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Preface 



The proceedings of ECML/PKDD 2003 are published in two volumes: the Pro- 
ceedings of the 14th European Conference on Machine Learning (LNAI 2837) 
and the Proceedings of the 7th European Conference on Principles and Practice 
of Knowledge Discovery in Databases (LNAI 2838). The two conferences were 
held on September 22-26, 2003 in Cavtat, a small tourist town in the vicinity of 
Dubrovnik, Croatia. 

As machine learning and knowledge discovery are two highly related fields, 
the co-location of both conferences is beneficial for both research communities. In 
Cavtat, ECML and PKDD were co-located for the third time in a row, following 
the successful co-location of the two European conferences in Freiburg (2001) 
and Helsinki (2002). The co-location of ECML 2003 and PKDD 2003 resulted in 
a joint program for the two conferences, including paper presentations, invited 
talks, tutorials, and workshops. 

Out of 332 submitted papers, 40 were accepted for publication in the 
ECML 2003 proceedings, and 40 were accepted for publication in the PKDD 2003 
proceedings. All the submitted papers were reviewed by three referees. In addi- 
tion to submitted papers, the conference program consisted of four invited talks, 
four tutorials, seven workshops, two tutorials combined with a workshop, and a 
discovery challenge. 

We wish to express our gratitude to 

— the authors of submitted papers, 

— the program committee members, for thorough and timely paper evaluation, 

— invited speakers Pieter Adriaans, Leo Breiman, Christos Faloutsos, and 
Donald B. Rubin, 

— tutorial and workshop chairs Stefan Kramer, Luis Torgo, and Luc Dehaspe, 

— local and technical organization committee members, 

— advisory board members Luc De Raedt, Tapio Elomaa, Peter Flach, Heikki 
Mannila, Arno Siebes, and Hannu Toivonen, 

— awards and grants committee members Dunja Mladenic, Rob Holte, and 
Michael May, 

— Richard van der Stadt for the development of CyberChair which was used 
to support the paper submission and evaluation process, 

— Alfred Hofmann of Springer- Verlag for co-operation in publishing the pro- 
ceedings, and finally 

— we gratefully acknowledge the financial support of the Croatian Ministry 
of Science and Technology, Slovenian Ministry of Education, Science, and 
Sports, and the Knowledge Discovery Network of Excellence (KDNet). 
KDNet also sponsored the student grants and best paper awards, while 
Kluwer Academic Publishers (the Machine Learning Journal) awarded a 
prize for the best student paper. 
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Preface 



We hope and trust that the week in Cavtat in late September 2003 will be 
remembered as a fruitful, challenging, and enjoyable scientific and social event. 



June 2003 Nada Lavrac 

Dragan Gamberger 
Hendrik Blockeel 
Ljupco Todorovski 
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From Knowledge-Based to Skill-Based Systems: 
Sailing as a Machine Learning Challenge 



Pieter Adriaans 

FNWI / ILLC 
University of Amsterdam 
Plantage Muidergracht 24 
1018 TV Amsterdam 
The Netherlands 
pieteraSscience .uva.nl 
http: //tuning. wins .uva.nl/~pietera/ALS/ 



Abstract. This paper describes the Robosail project. It started in 1997 
with the aim to build a self-learning anto pilot for a single handed sailing 
yacht. The goal was to make an adaptive system that would help a single 
handed sailor to go faster on average in a race. Presently, after hve years 
of development and a number of sea trials, we have a commercial system 
available (www.robosail.com). It is a hybrid system using agent tech- 
nology, machine learning, data mining and rule-based reasoning. Apart 
from describing the system we try to generalize our findings, and argue 
that sailing is an interesting paradigm for a class of hybrid systems that 
one could call Skill-based Systems. 



1 Introduction 

Sailing is a difficult sport that requires a lot of training and expert knowledge 
[1],[9],[6]. Recently the co-operation of crews on a boat has been studied in the 
domain of cognitive psychology [4] . In this paper we describe the Robosail system 
that aims at the development of self-learning steering systems for racing yachts 
[8] . We defend the view that this task is an example of what one could call skill- 
based systems. The connection between verbal reports of experts performing 
a certain task and the implementation of ML for those task is an interesting 
emerging research domain [3], [2], [7]. The system was tested in several real-life 
race events and is currently commercially available. 



2 The Task 

Modern single-handed sailing started its history with the organization of the 
first Observer Single-Handed Transatlantic Race (OSTAR) in 1960. Since that 
time the sport has known a tremendous development and is the source of many 
innovations in sailing. A single-handed skipper can only attend the helm for 
about 20% of his time. The rest is divided between boat-handling, navigation, 
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preparing meals, doing repairs and sleeping. All single-handed races allow the 
skippers to use some kind of autopilot. In its simplest form such an autopilot is 
attached to a flux-gate compass and it can only maintain a compass course. More 
sophisticated autopilots use a variety of sensors (wind, heel, global positioning 
system etc.) to steer the boat optimally. In all races the use of engines to propel 
the boat and of electrical winches to operate the sails is forbidden. All boat- 
handling except steering is to be done with manual power only. 

It is clear that a single-handed sailor will be less efficient than a full crew. 
Given the fact that a single-handed yacht operates on autopilot for more than 80 
% of the time a slightly more efficient autopilot would already make a yacht more 
competitive. In a transatlantic crossing a skipper will alter course maybe once or 
twice a day based on meteorological data and information and from various other 
sources like the positions of the competitors. From an economic point of view 
the automatization of this task has no top priority. It is the optimization of the 
handling of the helm from second to second that offers the biggest opportunity 
for improvement. The task to be optimized is then: steer the ship as fast as 
possible in a certain direction and give the skipper optimal support in terms of 
advice on boat-handling, early warnings, alerts etc. 

3 Introduction 

Our initial approach to the limited task of maintaining the course of a vessel 
was to conceive it as a pure machine learning task. At any given moment the 
boat would be in a certain region of a complex state-space defined by the array 
of sensor inputs. There was a limited set of actions defined in terms of a force 
exercised on the rudder, and there was a reward defined in terms of the overall 
speed of the boat. Fairly soon it became clear that it was not possible to solve 
the problem in terms of simple optimization of a system in a state-space: 

— There is no neutral theory-free description of the system. A sailing yacht is a 
system that exists on the border between two media with strong non-linear 
behavior, wind and water. The interaction between these media and the 
boat should ideally be modelled in terms of complex differential equations. 
A finite set of sensors will never be able to give enough information to analyze 
the system in all of its relevant aspects. A careful selection of sensors given 
economical, energy management and other practical constraint is necessary. 
In order to make this selection one needs a theory about what to measure. 

— Furthermore, given the complexity of the mathematical description, there is 
no guarantee that the system will know regions of relative stability in which 
it can be controlled efficiently. The only indication we have that efficient 
control is possible is the fact that human experts do the task well, and the 
best guess as to select which sensors is the informal judgement of experts on 
the sort of information they need to perform the task. The array of sensors 
that ‘describes’ the system is in essence already anthropomorphic. 

— Establishing the correct granularity of the measurements is a problem. Wind 
and wave information typically comes with the frequency of at least 10 hz. 
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But hidden in these signals are other concepts that exist only on a different 
timescale eg. gusts (above 10 seconds), veering (10 minutes) and sea-state 
(hours). A careful analysis of sensor information involved in sailing shows 
that sensors and the concepts that can be measured with them cluster in 
different time- frames (hundreds of seconds, minutes, hours). This is a strong 
indication for a modular architecture. The fact that at each level decisions 
of a different nature have to be taken strongly suggest an architecture that 
consists of a hierarchy of agents that operate in different time-frames: lower 
agents have a higher measurement granularity, higher agents a lower one. 

— Even when a careful selection of sensors is made and an adequate agent- 
architecture is in place the convergence of the learning algorithms is a prob- 
lem. Tabula rasa learning is in the context of sailing impossible. One has to 
start with a rough rule-based system that operates the boat reasonably well 
and use ML techniques to optimize the system. 

In the end we developed a hybrid agent based system. It merges traditional AI 
techniques like rule based reasoning with more recent methods developed in the 
ML community. Essential for this kind of systems is the link between expert 
concepts that have a fuzzy nature and learning algorithms. A simple example of 
an expert rule in the Robosail system is: If you sail close-hauled then luff in a 
gust. This rule contains the concepts ‘close-hauled’, ‘gust’ and ‘luff’. The system 
contains agents that represent these concepts: 

— Course agent: If the apparent wind angle is between A and B then you sail 
close hauled 

— Gust agent: If the average apparent wind increases by a factor D more than 
E seconds then there is a gust 

— Luff agent: Steer Z degrees windward. 

The related learning methodology then is: 

— Task: Learn optimal values for A,B,C,D,E,Z 

— Start with expert estimates then 

— Optimize using ML techniques 

This form of symbol grounding is an emerging area of research interest that seems 
to be of vital importance to the kind of skill-based systems like Robosail [8] , [2] , 
[7], [5]. 

4 The System 

The main systems contains four agents: Skipper, Navigator, Watchman and 
Helmsman. These roles are more or less modelled after the task division on 
a modern racing yacht [9]. Each agent lives in a different time frame, the agents 
are ordered in a subsumption hierarchy. The skipper is intended to take strategi- 
cal decisions with a time interval of say 3 to 6 hours. He has to take into account 
weather patterns, currents, seastate, chart info etc. Currently this process is only 
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partly automated. It results in the determination of a waypoint, i.e. a location 
on the map where we want to arrive as soon as possible. The navigator and the 
watchman have the responsibility to get to the waypoint. The navigator deals 
with the more tactic aspects of this process. He knows the so-called polar dia- 
grams of the boat and its behavior in various sea states. He also has a number 
of agents at his disposal that help him to asses the state of the ship: do we carry 
too much sail, is there too much current, is our trim correct etc. The reasoning 
of the navigator results in a compass course. This course could change within 
minutes. The watchman is responsible for keeping this course with optimal ve- 
locity in the constantly changing environment (waves, wind shifts etc.). He gives 
commands to the helmsman, whose only responsibility it is to make and execute 
plans to get and keep the rudder in certain positions in time. 

The Al solution: a hybrid agent based approach 



Input 

Wheather Maps, Electronic 
Charts, Tidal Info, 



Goal: Waypoint 

Info: COG, Position, Tidal Info, 
Polars, Variation, Deviation 



Goal: Compass Course 
Info: App. Wind Speed /Angle, 
Heel, Speed, Sailtrim, Polar 

Action: Rudder(Delta, Time) 
Info: Rudderangle, Speed, 
Trim, Seastate, Heel 



Processing 




Output 

Goal: Waypoint 



Goal: Compass Course 
Info: VMG or Course 



Action: Rudder(Delta, Time) 
Info: Trim, Seastate 



Direction(L,R) 
Force [0,1] 



Fig. 1. The hierarchy of main agents 



There are a number of core variables: log speed, apparent wind speed and 
angle, rudder angle, compass course, current position, course on ground and 
speed on ground. These are loaded into the kernel system. Apart from these core 
variables there are a number of other sensors that give information. Amongst 
others: canting angle mast, swivel angle keel, heel sideways, heel fore-aft, depth, 
sea state, wave direction, acceleration in various directions. Others will activate 
agents that warn for certain undesirable situations (i.e. depth, temperature of 
the water). Others are for the moment only used for human inspection (i.e. 
radar images). For each sensor we have to consider the balance between the 
contribution to speed and safety of the boat and the negative aspects like energy 
consumption, weight, increased complexity of the system. 
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Hybrid Architecture 
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Fig. 2. The main architecture 



The final system is a complex interplay between sensor-, agent- and network 
technology, machine learning and AI techniques brought together in a hybrid 
architecture. The hardware (CE radiation level requirements, water and shock 
proof, easy to mount and maintain) consists of: 

~ CAN bus architecture: Guaranteed delivery 

~ Odys Intelligent rudder control unit (IRCU): 20 kHz, max. 100 Amp (Ex- 
tensive functions for self-diagnosis) 

~ Thetys Solid state digital motion sensor and compass 

— Multifunction display 

— Standard third party sensors with NMEA interface (e.g. B&G) 

The software functionality involves: 

— Agent based architecture 

— Subsumption architecture 

— Model builder: on line visual programming 

— Real Time flow charting 

— Relational database with third party datamining facility 

— Web enabling 

— Remote control and reporting 

Machine Learning and AI techniques that are used: 

— Watch man: Case Based Reasoning 

— Helmsman: neural network on top of PID controller 
~ Advisor: nearest-neighbor search 
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— Agents and virtual sensors for symbol grounding 

— Data-explorer with machine learning suite 

— Waverider: 30 dimensional ARMA model 

— Off line KDD effort: rule induction on the basis of fuzzy expert concepts 

Several protypes of the Robsosail system have been tested over the years: the 
first version in the Single Handed Transatalantic in 2000, a second prototype was 
evaluated on board the Kingfisher during a trip from Brazil to the UK. A final 
evaluation was done on board of the Syllogic Sailing Lab during the Dual Round 
Britain and Ireland in 2002. In 2003 the first commercial version is available. 

5 Lessons Learned 

The Robosail application is a hybrid system that can be placed somewhere be- 
tween pure rule-based systems and pure machine learning systems. The nature of 
these systems raises some interesting philosophical issues concerning the nature 
of rules and their linguistic representations. In the course of history people have 
discovered that certain systems can be built and controlled, without really un- 
derstanding why this is the case. A sailing boat is such a system. It is what it is 
because of ill-understood hydro- and aerodynamical principles and has a certain 
form because the human body has to interact with it. It is thoroughly an an- 
thropomorphic machine. Human beings can handle these systems, because they 
are the result of a long evolutionary process. Their senses are adapted to those 
regions of reality that are relatively stable and are sensitive to exactly those 
phase changes that give relevant information about the state of the systems. 
In a process of co-evolution the language to communicate about these concepts 
emerged. Specific concepts like ‘wave’, ‘gust’ and ‘veering’ exist because they 
mark relevant changes of the system. Their cognitive status however is complex, 
and it appears to be non-trivial to develop automated systems that discover 
these concepts on the basis of sensor data. 

A deeper discussion of these issues would have to incorporate an analysis of 
the nature of rules that is beyond the scope of this paper. The rules of a game 
like chess exist independently of their verbal representation. We use the verbal 
representation to communicate with others about the game and to train young 
players. A useful distinction is the one between constitutive rules and regulative 
rules. The constitutive rules define the game. If they are broken the game stops. 
An example for chess would be: You may not move a piece to a square already 
occupied by one of your own pieces. Regulative rules define good strategies for 
the game. If you break them you diminish your chances of winning, but the 
game does not stop. An example of a regulative rule for chess would be: When 
you are considering giving up some of your pieces for some of your opponent’s, 
you should think about the values of the men, and not just how many each player 
possesses. Regulative rules represent the experience of expert players. They have 
a certain fuzzyness and it is difficult to implement them in pure knowledge-based 
systems. The only way we can communicate about skills is in terms of regulative 
rules. The rule If you sail clause hauled then luff in gust is an example. Verbal 
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reports of experts in terms of regulative rules can play an important role in the 
design of systems. From a formal point of view they reduce the complexity of the 
task. They tell us where to look in the state space of the system. From a cognitive 
point of view they play a similar role in teaching skills. They tell the student 
roughly what to do. The fine tuning of the skill is then a matter of training. 
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Fig. 3. A taxonomy of systems 



This discussion suggests that we can classify tasks in two dimensions: 1) The 
expert dimension: Do human agents perform well on the task and can they re- 
port verbally on their actions and 2) The formal dimension: do we have adequate 
formal models of the task that allow us to perform tests in silico? For chess and 
a number of other tasks that were analyzed in the early stages of AI research the 
answer to both questions is yes. Operations research studies systems for which 
the first answer is no and the second answer is yes. For sailing the answer to 
the first question is positive, the answer to the second question negative. This 
is typical for skill-based systems. This situation has a number of interesting 
methodological consequences: we need to incorporate the knowledge of human 
experts into our system, but this knowledge in itself is fundamentally incomplete 
and needs to be embedded in an adaptive environment. Naturally this leads to 
issues concerning symbol grounding, modelling human judgements, hybrid ar- 
chitectures and many other fundamental questions relevant for the construction 
of ML applications in this domain. 

A simple sketch of a methodology to develop skill-based systems would be: 

— Select sensor type and range based on expert input 

— Develop partial model based on expert terminology 
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— Create agents that emulate expert judgements 

— Refine model using machine learning techniques 

— Evaluate model with expert 



6 Conclusion and Further Research 

In this paper we have sketched our experiences creating an integrated system 
for steering a sailing yacht. The value of such practical projects can hardly be 
overestimated. Building real life systems is 80% engineering and 20% science. 
One of the insights we developed is the notion of the existence of a special class 
of skill-based systems. Issues in constructing these systems are: the need for a 
hybrid architecture, the interplay between discursive rules (expert system, rule 
inductionjand senso-motoric skills (pid-controllers, neural networks), a learning 
approach, agent technology, the importance of semantics and symbol grounding 
and the importance of jargon. The nature of skill-based systems raises interesting 
philosophical issues concerning the nature of rules and their verbal representa- 
tions. 

In the near future we intend to develop more advanced systems. The current 
autopilot is optimized to sail as fast as possible from A to B. A next generation 
would also address tactical and strategic tasks, tactical: win the race (modelling 
your opponents), strategic: bring the crew safely to the other side of the ocean. 
Other interesting ambitions are: the construction of better autopilots for multi- 
hulls, the design an ultra-safe autonomous cruising yacht, establish an official 
speed record for autonomous sailing yachts and deploy the Robosail technology 
in other areas like the Automotive industry and aviation industry. 
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Two-eyed algorithms are complex prediction algorithms that give accurate pre- 
dictions and also give important insights into the structure of the data the al- 
gorithm is processing. The main example I discuss is RF/ tools, a collection of 
algorithms for classification, regression and multiple dependent outputs. The last 
algorithm is a preliminary version and further progress depends on solving some 
fascinating questions of the characterization of dependency between variables. 

An important and intriguing aspect of the classification version of RF /tools 
is that it can be used to analyze unsupervised data-that is, data without class 
labels. This conversion leads to such by-products as clustering, outlier detection, 
and replacement of missing data for unsupervised data. 

The talk will present numerous results on real data sets. The code (f77) and 
ample documentation for RFtools is available on the web site 
WWW . stat . berkeley . edu/RFtools. 
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Abstract. What patterns can we find in a bursty web traffic? On the 
web or internet graph itself? How about the distributions of galaxies in 
the sky, or the distribution of a company’s customers in geographical 
space? How long should we expect a nearest-neighbor search to take, 
when there are 100 attributes per patient or customer record? The tra- 
ditional assumptions (uniformity, independence, Poisson arrivals, Gaus- 
sian distributions), often fail miserably. Should we give up trying to find 
patterns in such settings? 

Self-similarity, fractals and power laws are extremely successful in de- 
scribing real datasets (coast-lines, rivers basins, stock-prices, brain- 
surfaces, communication-line noise, to name a few). We show some old 
and new successes, involving modeling of graph topologies (internet, web 
and social networks); modeling galaxy and video data; dimensionality re- 
duction; and more. 



Introduction — Problem Definition 

The goal of data mining is to find patterns; we typically look for the Gaussian 
patterns that appear often in practice and on which we have all been trained 
so well. However, here we show that these time-honored concepts (Gaussian, 
Poisson, uniformity, independence), often fail to model real distributions well. 
Further more, we show how to fill the gap with the lesser-known, but even more 
powerful tools of self-similarity and power laws. 

We focus on the following applications: 

— Given a cloud of points, what patterns can we find in it? 

— Given a time sequence, what patterns can we find? How to characterize and 
anticipate its bursts? 

— Given a graph (e.g., social, or computer network), how does it look like? 
Which is the most important node? Which nodes should we immunize first, 
to guard against biological or computer viruses? 

All three settings appear extremely often, with vital applications. Glouds of 
points appear in traditional relational databases, where records with fc-attributes 
become points in k-d spaces; e.g. a relation with patient data (age, blood pres- 
sure, etc.); in geographical information systems (GIS), where points can be, e.g., 
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cities on a two-dimensional map; in medical image databases with, for example, 
three-dimensional brain scans, where we want to find patterns in the brain ac- 
tivation [ACF+93]; in multimedia databases, where objects can be represented 
as points in feature space [FRM94]. In all these settings, the distribution of k- 
d points is seldom (if ever) uniform [Chr84], [FK94]. Thus, it is important to 
characterize the deviation from uniformity in a succinct way (e.g. as a sum of 
Gaussians, or something even more suitable). Such a description is vital for data 
mining [AIS93],[AS94], for hypothesis testing and rule discovery. A succinct de- 
scription of a k-d point-set could help reject quickly some false hypotheses, or 
could help provide hints about hidden rules. 

A second, very popular class of applications is time sequences. Time se- 
quences appear extremely often, with a huge literature on linear [BJR94], and 
non-linear forecasting [CE92], and the recent surge of interest on sensor data 
[OJW03] [PBF03] [GGR02] 

Finally, graphs, networks and their surprising regularities/laws have been 
attracting significant interest recently. The applications are diverse, and the dis- 
coveries are striking. The World Wide Web is probably the most impressive 
graph, which motivated significant discoveries: the famous Kleinberg algorithm 
[Kle99] and its closely related PageRank algorithm of Google fame [BP98]; the 
fact that it obeys a “bow-tie” structure [BKM+00], while still having a sur- 
prising small diameter [AJB99]. Similar startling discoveries have been made 
in parallel for power laws in the Internet topology [FFF99], for Peer-to-Peer 
(gnutella/Kazaa) overlay graphs [RFI02], and for who-trusts-whom in the epin- 
ions.com network [RD02]. Finding patterns, laws and regularities in large real 
networks has numerous applications, exactly because graphs are so general and 
ubiquitous: Link analysis, for criminology and law enforcement [GSH“''03]; anal- 
ysis of virus propagation patterns, on both social/e-mail as well as physical- 
contact networks [WKEOO]; networks of regulatory genes; networks of interact- 
ing proteins [Bar02]; food webs, to help us understand the importance of an 
endangered species. 

We show that the theory of fractals provide powerful tools to solve the above 
problems. 



Definitions 

Intuitively, a set of points is a fractal if it exhibits self-similarity over all scales. 
This is illustrated by an example: Figure 1(a) shows the first few steps in con- 
structing the so-called Sierpinski triangle. Figure fib) gives 5,000 points that 
belong to this triangle. Theoretically, the Sierpinski triangle is derived from an 
equilateral triangle ABG by excluding its middle (triangle A’B’G’) and by recur- 
sively repeating this procedure for each of the resulting smaller triangles. The 
resulting set of points exhibits ‘holes’ in any scale; moreover, each smaller trian- 
gle is a miniature replica of the whole triangle. In general, the characteristic of 
fractals is this self- similarity property: parts of the fractal are similar (exactly 
or statistically) to the whole fractal. For our experiments we use 5,000 sam- 
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pie points from the Sierpinski triangle, using Barnsley’s algorithm of Iterated 
Function Systems [BS88] to generated these points quickly. 




(a) construction (b) a finite sample 

Fig. 1. Theoretical fractals: the Sierpinski triangle (a) the first 3 steps of its recursive 
construction (b) a finite sample of it (5K points) 



Notice that the resulting point set is neither a 1-dimensional Euclidean object 
(it has infinite length), nor 2-dimensional (it has zero area). The solution is to 
consider fractional dimensionalities, which are called fractal dimensions. Among 
the many definitions, we describe the correlation fractal dimension, D, because 
it is the easiest to describe and to use. 

Let nb{e) be the average number of neighbors of an arbitrary point, within 
distance e or less. For a real, finite cloud of E-dimensional points, we follow 
[Sch91] and say that this data set is self-similar in the range of scales ri,r 2 if 

nfe(e) cx ri < e < T 2 (1) 

The correlation integral is defined as the plot of nh{e) versus e in log-log scales; 
for self-similar datasets, it is linear with slope D. 

Notice that the above definition of fractal dimension D encompasses the 
traditional Euclidean objects: lines, line segments, circles, and all the standard 
curves have D=l\ planes, disks and standard surfaces have D=2] Euclidean 
volumes in E-dimensional space have D = E. 

Discussion — How Frequent Are Self-similar Datasets? 

The reader might be wondering whether any real datasets behave like frac- 
tals, with linear correlation integrals. Numerous the real datasets give linear 
correlation integrals, including longitude-latitude coordinates of stars in the 
sky, population- versus-area of the countries of the world [FK94]; several geo- 
graphic datasets [BF95] [FK94]; medical datasets [FG96]; automobile-part shape 
datasets [BBB+97,BBKK97]. 

There is overwhelming evidence from multiple disciplines that fractal datasets 
appear surprisingly often [Man77](p. 447),[Sch91]: 
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— coast lines and country borders (I?« 1.1- 1.3); 

— the periphery of clouds and rainfall patches {D « 1.35) [Sch91](p. 231); 

— the distribution of galaxies in the universe {D « 1.23); 

— stock prices and random walks (11=1.5) 

— the brain surface of mammals {D « 2.7); 

— the human vascular system {D = 3, because it has to reach every cell in the 
body!) 

~ even traditional Euclidean objects have linear box-counting plots, with inte- 
ger slopes 

Discussion — Power Laws 

Self-similarity and power laws are closely related. A power law is a law of the 
form 

y = fix) = x°- ( 2 ) 

Power laws are the only laws that have no characteristic scales, in the sense that 
they remain power laws, even if we change the scale: /(c * x) = c“ * 

Exactly for this reason, power laws and self-similarity appear often together: 
if a cloud of points is self similar, it has no characteristic scales; any law/pattern 
it obeys, should also have no characteristic scale, and it should thus be a power 
law. 

Power laws also appear extremely often, in diverse settings: in text, with the 
famous Zipf law [Zip49]; in distributions of income (the Pareto law); in scientific 
citation analysis (Lotka law); in distribution of areas of lakes, islands and animal 
habitats (Korcak’s law [Sch91,HS93,PF01]) in earthquake analysis (Gutenberg- 
Richter law [Bak96]; in LAN traffic [LTWW94]; in web click-streams [MFOl]; 
and countless more settings. 

Conclusions 

Self-similarity and power laws can solve data mining problems that traditional 
methods can not. The two major tools that we cover in the talk are: (a) the “cor- 
relation integral” [Sch91] for a set of points and (b) the “rank-frequency” plot 
[Zip49] for categorical data. The former can estimate the intrinsic dimensionality 
of a cloud of points, and it can help with dimensionality reduction [TTWFOO], 
axis scaling [WF02], and separability [TTPFOl]. The rank-frequency plot can 
spot power laws, like the Zipf’s law, and many more. 
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Propensity score methods were proposed by Rosenbaum and Rubin (1983, Bio- 
metrika) as central tools to help assess the causal effects of interventions. Since 
their introduction two decades ago, they have found wide application in a variety 
of areas, including medical research, economics, epidemiology, and education, 
especially in those situations where randomized experiments are either difficult 
to perform, or raise ethical questions, or would require extensive delays before 
answers could be obtained. Rubin (1997, Annals of Internal Medicine) provides 
an introduction to some of the essential ideas. In the past few years, the number 
of published applications using propensity score methods to evaluate medical and 
epidemiological interventions has increased dramatically. Rubin (2003, Erlbaum) 
provides a summary, which is already out of date. 

Nevertheless, thus far, there have been few applications of propensity score 
methods to evaluate marketing interventions (e.g., advertising, promotions), 
where the tradition is to use generallly inappropriate techniques, which focus 
on the prediction of an outcome from an indicator for the intervention and 
background characteristics (such as least-squares regression, data mining, etc.). 
With these techniques, an estimated parameter in the model is used to esti- 
mate some global “causal” effect. This practice can generate grossly incorrect 
answers that can be self-perpetuating: polishing the Ferraris rather than the 
Jeeps “causes” them to continue to win more races than the Jeeps j=i visiting 
the high-prescribing doctors rather than the low-prescribing doctors “causes” 
them to continue to write more prescriptions. 

This presentation will take “causality” seriously, not just as a casual con- 
cept implying some predictive association in a data set, and will show why 
propensity score methods are superior in practice to the standard predictive ap- 
proaches for estimating causal effects. The results of our approach are estimates 
of individual- level causal effects, which can be used as building blocks for more 
complex components, such as response curves. We will also show how the stan- 
dard predictive approaches can have important supplemental roles to play, both 
for refining estimates of individual-level causal effect estimates and for assess- 
ing how these causal effects might vary as a function of background information, 
both important uses for situations when targeting an audience and/or allocating 
resources are critical objectives. 

The first step in a propensity score analysis is to estimate the individual 
scores, and there are various ways to do this in practice, the most common 
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being logisitic regression. However, other techniques, such as probit regression 
or discriminant analysis are also possible, as are the robust methods based on 
the t-family of long tailed distributions. Other possible methods include highly 
non-linear methods such as CART or neural nets. A critical feature of esti- 
mating propensity scores is that diagnosing the adequacy of the resulting fit is 
very straightforward, and in fact guides what the next steps in a full propen- 
sity score analysis should be. This diagnosing takes place without access to the 
outcome variables (e.g., sales, number of prescriptions) so that that objectivity 
of the analysis is maintained. In some cases, the conclusion of the diagnostic 
phase must be that inferring causality from the data set at hand is impossible 
without relying on heroic and implausible assumptions, and this can be very 
valuable information, information that is not directly available from traditional 
approaches. 

Marketing applications from the practice of AnaBus, Inc. will also be pre- 
sented. AnaBus currently has a Small Business Innovative Research Grant from 
the US NIH to implement essential software to allow the implementation of the 
full propensity score approach to estimating the effects of interventions. Other 
examples will also be presented if time permits, for instance, an application from 
the current litigation in the US on the effects of cigarette smoking (Rubin, 2002, 
Health Services Outcomes Research). 

An extensive reference list from the author is included. These references are 
divided into five categories. First, general articles on inference for causal effects 
not having a focus on matching or propensity scores. Second, articles that focus 
on matching methods before the formulation of propensity score methods - some 
of these would now be characterized as examples of propensity score matching. 
Third, articles that address propensity score methods explicitly, either theoreti- 
cally or through applications. Fourth, articles that document, by analysis and/or 
by simlulation, the superiority of propensity-based methods, especially when 
used in combination with model-based adjustments, over model-based methods 
alone. And fifth, introductions and reviews of propensity score methods. The 
easiest place for a reader to start is with the last collection of articles. 

Such a reference list is obviously very idiosyncratic and is not meant to imply 
that only the author has done good work in this area. Paul Rosenbaum, for 
example, has been an extremely active and creative contributor for many years, 
and his text book “Observational Studies” is truly excellent. As another example, 
Rajeev Deheija and Sadek Wahba’s 1999 article in the Journal of the American 
Statistical Association had been very influential, especially in economics. 
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Abstract. Association mining is the comprehensive identification of fre- 
quent patterns in discrete tabular data. The result of association mining 
can be a listing of hnndreds to millions of patterns, of which few are 
likely of interest. In this paper we present a probabilistic metric to filter 
association rules that can help highlight the important strncture in the 
data. The proposed filtering techniqne can be combined with maximal 
association mining algorithms or heuristic association mining algorithms 
to more efficiently search for interesting association rules with lower sup- 
port. 



1 Introduction 

Association mining is the process of identifying frequent patterns in a tabular 
dataset, usually requiring some minimum support, or frequency of the pattern in 
the data [2] . The discovery of frequent patterns in the data is usually followed by 
the construction of association rules, which portray the patterns as predictive re- 
lationships between particular attribute values. Unfortunately, since association 
mining is an exhaustive approach, it is possible to generate many more patterns 
than a user can reasonably evaluate. Furthermore, many of these patterns may 
be redundant. Thus, is it important to develop informed and efficient pruning 
systems for association mining rules. 

A number of methods to filter association mining results have been published, 
and they can be classified along several lines. First, the filtering can be objective 
or subjective [11]. Our goal is to design an objective, or purely computational, 
filter, both for inter-discipline generality and to avoid the bias introduced by 
subjective evaluation. In a separate categorization, rule filtering can be done 
on a rule- by-rule basis [7], or in an incremental manner where rules deemed in- 
teresting are gradually added to a rule list set [6] or probability model [12]. We 
focus our attention on the rule-by-rule approach. This approach is appropriate in 
application areas where dense data tables lead to the generation of millions of as- 
sociation rules, a set too large to use directly in incremental filtering algorithms. 
As an added advantage, filtering rules independently allows the straightforward 
use of batch parallelization of the filtering to produce linear speed-ups. 
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2 Background 

It has long been understood in statistics that while not every significant feature 
of a dataset is interesting, an interesting feature must be statistically significant. 
Non-significant results, by definition, are those features that can be explained 
as a random effect, and therefore not worthy of further study. For example, an 
association rule predicting a customer action with 100% accuracy would not 
be significant if it only covers 2 customers in a large database, and so would 
not be interesting. This aspect of interestingness is sometimes referred to as the 
reliability of a rule [11]. 

In many association rule filtering approaches, the measure of significance or 
reliability has been largely ad-hoc, stemming from Boolean logical theory rather 
than statistical theory. Boolean approaches typically combine the support and 
confidence of a rule in a dual-ordering approach, trying to maximize both support 
and confidence, with an implied maximization of reliability [5]. Related logical 
mechanisms for removing redundant association rules use the concept of closed 
itemsets [8,21]. Closure-based methods are most effective in noise-free problems. 
In these problems, large sets of records contain exactly the same associations, 
allowing the pruning of redundant subsets of the items without loss of infor- 
mation. In noisier data, fewer rules can be considered redundant according to 
closure properties, limiting their effectiveness. For such noisy dataset, statistical 
techniques have been used by other authors, focusing either on pruning asso- 
ciation rules [14,19] or identifying correlated attributes involved in association 
rules [12,17]. 

3 Problem Statement 

In our approach to association rule filtering, we will follow the approach of Liu et 
al. [14] and use a statistical model to evaluate the reliability of association rules, 
with a focus on association rules with a single value/item in the consequent of 
the rule. Furthermore, we pose our problem as one of association rule mining 
over relational tables, rather than itemsets. 

In mining over relational tables, the input consists of a table T with a set 
of N records, R = {ri, . . . , un}, and n attributes, A = {oi, . . . , a„}; all of the 
attributes take on a discrete set of values, Dom{ai). We assume the table contains 
no missing values. 

The output of an association mining exercise is a set of patterns, where a 
pattern X associates each attribute in a subset of A with a particular value, 

X = {< Vi >, < 

^X2 5 ^2 ^5 ■> Vk >} 

such that Vi G Dom{axi)- pattern that contains k attribute/ value pairs is 
called a k-th order pattern. A record ri of the table matches or instantiates a 
pattern X if, for each attribute aj in X, ri contains the value Vj for aj. 

Each of the patterns (also called itemsets) has a number of descriptive pa- 
rameters. We define the support of a pattern, Sup{X\T), as the cardinality of 
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the set Xt, denoted \Xt\, such that Xt = {vi G Tjr^ is an instance of pattern 
X}. Wherever the table T is the original input table, we simply write Sup{X). 

Often patterns are combined to make a larger pattern, through the compo- 
sition of two smaller patterns, X and Y, denoted XoY. The composition of two 
patterns produces a longer pattern including the attribute/value pairs of both 
input patterns: 

XoY = {< 0 x 1 1 < Qsfc ; 'Vxk > , < Q.J /1 , Uyj > , < a,y^ , Vy^ > } 

To avoid degeneracies, we define composition only for patterns with non- 
overlapping attributes. Composition of patterns of order k\ and k 2 results in 
a pattern of order k\ k 2 - 

An association rule is a pairing of two patterns, X ^ Z, and is interpreted 
interpreted as a causal or correlational relationship. The support and confidence 
of an association rule, denoted Sup{X — >• Z) and Conf{X — >• Z) are defined in 
terms of support for the patterns and their compositions: 

Sup{X ^Z) = Sup{XoZ), Confix ^ Z) = 

4 Reliable Association Rules 

In an association mining study, the set of patterns generated can be straightfor- 
wardly turned into a set of association rules [2]. The output of association rule 
generation will often be a long list of rules, Xi — >■ Zi, with the number of rules 
determined by the table T, the minimum support required for any rule, minsup, 
and possibly a minimum confidence constraint, minconf. The two questions we 
want to address in our filtering, for any particular rule X ^ Z, are 

1. Is the rule X ^ Z reliable (statistically significant), given Sup{X), Sup{Z), 
and Sup{XoZ)l 

2. Is the rule A — >■ Z an unreliable extension of a lower-order rule, Y ^ Z, 
where X = YoQ, for some non-predictive pattern Q1 

We approach both questions using a statistical sampling model. Imagine an 
urn filled with N balls, each ball representing one record of the input table T. For 
each record, its ball is colored red if it matches the consequent pattern Z, and 
black if does not. A sample is then taken from the urn of all those balls matching 
the predictive or antecendent pattern X. We then ask ourselves the question, 
“is the distribution of red and black balls matching the pattern X significantly 
different from what we would expect from a random scoop of Sup{X) balls from 
the urn?” This question can be answered using the hypergeometric probability 
distribution, which exactly represents the probability of this sampling-without- 
replacement model. [16] 

Specifically, if we have the four values Sup{X), Sup{Z), Sup{Z), and 
Sup(XoZ) (with shorthand Sx = Sup{X), Sxoz = Sup(XoZ), etc. for con- 
ciseness), the hypergeometric probability distribution, H, is given by: 

C{Sz,n)CiSz,Sx-n) 

CiSz + Sz^Sx) 



H{n = Sxoz\Sz, Sz, Sx) 



( 1 ) 
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where C{N, x) = xl{N — x)l/Nl is the number of ways to choose x records from 
a collection of N records. 

The function H gives the probability of exactly Sxoz red balls being selected. 
To compute a significance value, we need to calculate the cumulative probability, 
or the probability of an outcome as extreme as SxoZ, where either P{n > Sxoz), 
or P{n < Sxoz) through the following sums (where n is the number of red balls): 



Sx 

P{n>Sxoz\Sz,Sz,Sx)= H(k\Sz, Sz, Sx) (2) 

k—SxoZ 

SxoZ 

P{n < Sx.z\Sz, Sz, Sx) = ^ H{k\Sz, Sz, Sx) (3) 

k=0 

These formulae produce the p-values of the rule under the null hypothesis of the 
hypergeometric sampling model. A low p- value in Equation 2 indicates a higher- 
than-random confidence for the rule X ^ Z, while a low p- value in Equation 3 
indicates an interesting low confidence rule. 

It should be noted that other authors have used p-value ranking approaches, 
using other statistical measures. The measure has been the most popular, but 
studies have also used the gini index, correlation coefficient, and interest factor 
(see [19] for a comprehensive presentation of many of these measures). Unfortu- 
nately, all of these measures implicitly rely on large sample assumptions, and so 
their p-value estimates become less reliable as the patterns’ support decreases. 
The hypergeometric distribution, because its explicit counting approach, is an 
exact method, applicable to patterns at all support values. We believe that the 
hypergeometric is a more appropriate null hypothesis distribution when deal- 
ing with association rules that include patterns with low support (less than 50 
records). In the rest of the paper, we use the hypergeometric distribution exclu- 
sively, keeping in mind that other distributions could be used. 



5 Filtering Algorithms 



The hypergeometric probability distribution, or any other appropriate null hy- 
pothesis distribution such as can be used to test the significance of association 
rules in two ways. The simplest way is to evaluate the reliability of every asso- 
ciation rule discovered, X ^ Z hy computing its p-value (using Equations 3 
and 2) with the p-values being computed relative to the baseline distribution of 
the consequent pattern, Z. In particular, for any rule, Xi — >■ Z^ found during 
association mining, we can define the two-tailed p-value of rule, pi, as follows: 



Pi = min 



fP{n>Sx,ozASz,Sz,Sx)\ 

\P{n<Sx,oz,\Sz,Sz,Sx)J 



If Pi is less than some user-defined p-value threshold Pmax, we consider the rule 
Xi — >• Zi to be interesting. 
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This p-value ranking is simple, and is computationally linear in the number 
of association rules found. It provides a straightforward statistical approach to 
rank and filter association rules. The disadvantages are that many similar rules 
will be evaluated, and, if a particular rule is significant, e.g. X ^ Z, then adding 
spurious extensions to the antecedent is also likely to produce a significant rule 
e.g. (XoY) — >■ Z, even if K is a pattern which has no predictive relationship with 
Z. 

To overcome the problems of spurious rule extensions, a statistical model 
may be used in a incremental way to gradually prune away non-interesting parts 
of an association rule, leaving only the interesting part. This process is laid out 
in Algorithm 1. 

Algorithm 1 Incremental Probabilistic Pruning Iwpnt: A table (T), a minimum 
support (minsup), and a p-value cutoff (pmax)- Output.' a set of association 
rules, Rout- 

1. Create an empty set of association rules, Rout = </>• 

2. Construct a set of association rules, I = {Xi — >■ Zff from the set of records 

in T, constraining the search with the minimum support, minsup. 

3. For each association rule Xi — >■ Zi, (called X ^ Z for clarity in this step), 

(a) Refer to T to find the supports related to the association rule, Sx^z, 

(b) Compute the two-tailed p-value of the association rule given every possi- 
ble sub-rule Yj — >■ Z, where X = YjoQj, where Qj is a first-order pattern, 
and Yj is one order less than the pattern X. Take the maximum of the 
p-values over the possible patterns Yj : 

_ . f P{n > SxozlSvjoz, Sy^oz^ Sx)\ 

p,j - mm ^ Sxoz\Sy,oz, Sy.,z, Sx) ) 

max , , 

Vi ~ Y \Pijl 

using Equations 2 and 3 to compute the hypergeometric probabilities. 

(c) If Pi < Pmax, then none of the sub-rules Yij — >■ Zi can explain the rule 
Xi — >■ Zi, so add the rule Xi — >■ Zi to the collection Rout- 

(d) If Pi > Pmax, find the sub-rule Yij — >■ Zi for which the pij was maximal, 
and return to step 3, with Yij — >■ Zi instead of Xi ^ Zi. 

6 Implications of Pruning Algorithm 
for Association Mining 

The idea of using subsuming patterns to define the relative interestingness of 
association rules was proposed earlier by Liu et. al. [14], using the distribution 
rather than hypergeometric to measure rule significance. They performed their 
pruning following the incremental Apriori approach [1] . This involved finding and 
pruning all first-order rules, then second-order rules, etc. In dense data tables, 
as found by Liu et al., there are many association rules with relatively high 
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support (above 20%), but few rules that pass by significance pruning as outlines 
in Algorithm 1. However, since the rules with high support are most likely to 
be those already known in the domain, it behooves us to search for interesting 
rules of lower support, which is hopelessly infeasible using a pure Apriori-based 
approach. 

Fortunately, other association mining algorithms exist which do not require 
the explicit construction of every possible sub-pattern for an association rule. 
There are several algorithms that efficiently search for maximal frequent pat- 
terns [13,9]. A frequent pattern X is maximal if has no superset that is frequent: 
Sup{X) > minsup, and Sup(XoY) < minsup for any pattern Y. Two examples 
of algorithms that identify maximal itemsets are Max-Miner [13] and GenMax 
[9]. There are also heuristic association mining algorithms, such as SLAM [18] 
which do not use the recursive search strategy of the previous algorithms. Our 
hypothesis is that using the incremental pruning procedure in Algorithm 1, we 
should identify the roughly the same set of interesting rules whether we start 
from a relatively small set of maximal frequent patterns (e.g. GenMax), the com- 
plete set of frequent patterns (e.g. Apriori), or a heuristically-generated subset 
of patterns (e.g. SLAM). 

7 Applications 

There were two particular questions we wanted to answer in evaluating our asso- 
ciation rule pruning. The first was what percentage of rules identified on different 
datasets could be safely be pruned away using Algorithm 1. The second question 
was how effective incomplete searches for frequent patterns followed by pruning 
would be compared to more exhaustive searches followed by pruning. Incomplete 
association mining algorithms can be orders of magnitude faster than exhaustive 
algorithms, and produce orders of magnitude fewer associations. If the pruning 
leads to roughly the same results using these different association mining tools, 
then there is a considerable computational advantage to be gained by pruning 
only the rules generated by the incomplete association mining algorithms. 

In all the experiments, we computed p- values using the hypergeometric prob- 
ability calculation described in [20] . We found the algorithm sufficiently fast for 
the number of records encountered, and it provided exact estimates of the prob- 
abilities for rules with small support. If faster computations are necessary, the 
hypergeometric distribution can be approximated by an appropriate binomial, 
normal or distribution. 



7.1 Description of Datasets 

To evaluate the pruning technique, one simulated collection and two real-world 
datasets were considered. 

The simulated datasets were designed to test the sensitivity and specificity 
of the pruning approach in Algorithm 1. Tables were created with 51 binary- 
valued attributes (values “0” and “1”) and 1,000 records. The first 50 attributes 
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were treated as the predictor attributes, and the 51st attribute was the outcome 
attribute, or the consequent attribute for association rules. The goal was to find 
patterns predictive of the positive value, “1”, of this 51st attribute. The 50 
predictor attributes were generated independently, using a uniform probability 
for each binary value. With this distribution, all predictive patterns of size n 
have an expected support of 1,000/2”. 

The distribution for the outcome attribute was defined using two parameters. 
Up and Op, where rip specified the size of the predictive pattern, and Op the relative 
odds of a positive outcome if a record was an instance of the predictive pattern. 
For ease of identification, a pattern of size rip was considered to be the pattern of 
rip I’s on the first rip attributes. For example, if we chose rip = 3 and Op = 4, then 
records being an instance of the pattern “1,1,1” as the values for the first three 
attributes would have a 4 times greater chance of having a 1 in the outcome than 
the other records. Records not matching the predictive pattern had a probability 
of 1/2 of being an outcome of 1. 

The rationale behind this simulation is that it generates dense datasets with 
large numbers of association rules, of which only a single rule is actually pre- 
dictive. By controlling the size of the predictive rule by rip, we are effectively 
setting the support to 1000/2”^, and by setting the odds we can control the 
interestingness of the rule. 

The real datasets we studied have been used previously in evaluations of al- 
gorithms for maximal frequent pattern identification. For clarity, we focused our 
attention on the chess, and mushroom datasets, (available from the UCI Ma- 
chine Learning Repository [4]). The chess dataset contains 3,196 records with 23 
attributes, while the mushroom dataset contains 8,124 records of 23 attributes. 

7.2 Association Mining Algorithms 

For each of the datasets, we performed searches for patterns associated with an 
outcome of interest. In the simulated datasets, we searched for patterns with a 
‘1’ in the fifty-first attribute; for the chess dataset, we searched for patterns that 
contained the outcome “won”, and for the mushroom dataset, we searched for 
patterns that contained the outcome “edible” . 

We performed the search for these predictive patterns using four different 
association mining algorithms. Our goal was to decide whether the pruning tech- 
nique required the explicit listing of all associations or whether more efficient but 
less exhaustive association mining searches could provide the same results after 
pruning. The four algorithms we used were Apriori [2], FPMine* [10], SLAM 
[18], and MAX. 

Apriori was used in its normal fashion to find a complete set of patterns, 
given a particular minimum support constraint. We modified the FPMine algo- 
rithm of [10] slightly so it would not output every possible subset of patterns 
found when it had uncovered a path-tree in its recursion step; we called our 
variation FPMine*. With this change, FPMine* outputs fewer associations at 
a given minimum support than Apriori, but the set is still generally large and 
will include the maximal frequent patterns. SLAM is a heuristic association 
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p-value cutoff 



Fig. 1. Log-log plot of the number of association rules reported at different p-value 
cutoffs for Algorithm 1, applied to the simulated data. Different lines indicate different 
parameter settings, rip and Op, of the simulator for data generation. 



mining algorithm, which tends to generate fewer associations than Apriori and 
FPMine*, but can use a lower minsup. MAX was written as a simple recursive 
search for maximal patterns (see [13] and [9] for more efficient algorithms for 
mining maximal patterns). 



8 Results 

We present the results for the filtering of association rules below, discussing first 
the results using simulated datasets and then using the real datasets. 



8.1 Simulated Datasets 

For identifying rules on the simulated datasets, we only used the Apriori algo- 
rithm, with a minimum support cutoff of minsup = 40 records out of 1,000. 
With this support, Apriori generated between 500,000 and 1,000,000 association 
rules, of which only a single rule was actually predictive. Since Algorithm 1 takes 
a p-value cutoff as a parameter, we first studied the number of unique association 
rules left using different p-value cutoffs. The graph in Figure 1 shows the steep 
decline in the number of interesting rules as the p-value threshold is lowered^. 

From the graph in Figure 1, it is clear that the pruning dramatically reduced 
the number of patterns reported, so few non-predictive patterns are reported 
when the p-value threshold is set high enough. Our next concern was the sen- 
sitivity of the test: how frequently was the real pattern reported after pruning? 
We considered pattern sizes (up) from {1,2, 3,4}, and predictive odds (op) from 

^ It should be noted that the pruning rule is performing multiple statistical tests for 
each of the many association rules, and so a “typical” p-value cutoff like 0.01 is not 
appropriate: lower (more significant) values should be used to obtain more realistic 
significance estimates. A discussion of multiple testing is presented in [3]. 
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Success,at Identifyina Se^edPattern 
Tor various [r-v^ue Cutofrs 




Fig. 2. For the simulated data, using different p-values, the seeded motif is reported 
less often if it is relatively infrequent (larger motif sizes) or less predictive (lower odds). 



{1.5, 2, 3, 4}. For each possible pair (np,Op), we generated ten tables using the 
simulator distribution, performed a search for frequent patterns {minsup of 40), 
and pruned the resulting set using Algorithm 1. 

In Figure 2, we see the percentage of the runs for which the correct motif was 
found after pruning, for different p- values of the pruning algorithm. As expected, 
when the motifs were shorter and had higher odds, they were more significant 
and more consistently reported. Motifs with odds of 1.5 were only reported 
occasionally when the motif size was short (size 2), and never otherwise, even 
for least restrictive p-value of 0.01. This is unsurprising, given the low odds and 
relatively low prevalence of the rule; the occasional significant finding is the 
result of sampling variation. 

Motifs of size 3 or 4 with odds of 1.5 have too low a support to ever be 
significant even at p = 0.01 levels. 

The results of the association mining on simulated data indicate that the 
approach is effective at removing the vast majority of statistically non-interesting 
patterns, while still identifying real motifs if they achieve the specified statistical 
confidence level. We now consider the application of the pruning algorithm in 
the context of real data. 

8.2 Real Datasets 

For the real datasets, chess and mushroom, we used all four association mining 
algorithms, Apriori, FPMine*, SLAM and MAX. For Apriori and FPMine*, we 
experimented to find minimum support values that produced roughly 2 million 
patterns associated with the outcome of interest (winning the chess game, edible 
mushrooms). Since FPMine* does not report all patterns above the minimum 
support threshold, it could use a lower minimum support, and identify less fre- 
quent patterns. We ran MAX at various levels of support for which run times 
were reasonable (less than an hour). SLAM, being an iterative procedure, can 
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Table 1. Nnmber of patterns found before and after pruning for different association 
mining algorithms, using a pruning p- value of 0.0001. 





Dataset 




Chess 1 




minsup 


Patterns 


Pruned 

Patterns 


Apriori 


860 


2195736 


1467 


FPMine* 


760 


2078291 


2186 


SLAM 


300 


243782 


3541 


MAX 


300 


20356 


568 




1 Mushroom 




minsup 


Patterns 


Pruned 

Patterns 


Apriori 


280 


2334446 


3182 


FPMine* 


70 


2337408 


5122 


SLAM 


40 


26950 


1563 


MAX 


40 


4859 


231 



be run for any number of iterations, regardless of the minimum support. We se- 
lected a run-time of 10 minutes on each dataset to search for patterns (similar to 
Apriori and FPMine*). For ease of comparison, SLAM used a minimum support 
equal to that used with MAX. 

After each association mining algorithm was used on both datasets, the pat- 
terns found were pruned using Algorithm 1. The number of patterns before and 
after the pruning are given in Table 1. 

In Table 1, we see the clear monotonicity in the minimum supports used 
for the different algorithms. Furthermore, the incomplete searches (SLAM and 
MAX) generate orders of magnitude fewer associations than the more complete 
searches (Apriori and FPMine*). When different p- value cutoffs were used in the 
pruning, the relative number of pruned motifs for each algorithm stayed roughly 
constant (not shown). 

We also studied the relative overlap of the patterns reported by each method. 
We used the patterns generated by Apriori as the baseline for comparison. In 
general, using the FPMine* algorithm to find patterns with lower support re- 
sulted in a duplication of the patterns found by Apriori, with the addition of 
new significant patterns with lower support. For the chess dataset, all of the 
pruned patterns from Apriori were found in the pruned FPMine* patterns, with 
FPMine* having an additional 719 novel patterns. For the mushroom dataset, 
there were some patterns found in the pruned Apriori results (169 patterns), 
while FPMine* reported an additional 2109 patterns beyond the set found by 
Apriori. This indicated that overall, if more significant patterns were desired, 
then using an approach like FPMine*, with a relatively low minimum support, 
would be more rewarding than performing a complete search with higher support 
using Apriori. 

The patterns found while pruning the SLAM and MAX results were quite 
different from those found by the more complete search methods. In the chess 
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dataset, of MAX’s pruned patterns, roughly two-thirds were not present in the 
patterns from Apriori, but 1123 patterns from Apriori were not found by MAX. 
In this situation, there is argument for using several search algorithms to maxi- 
mize the likelihood of finding a comprehensive set of interesting patterns. 

As we hoped, the heuristic association mining algorithm SLAM provided a 
complementary approach between the extremes of discovering all patterns and 
only the maximal patterns. It generated far fewer patterns than the complete 
methods, of which a much higher proportion pruned down to unique significant 
patterns. 

9 Discussion 

We have shown that the proposed filtering algorithm can be used to reliably 
detect truly predictive patterns in data tables with many spurious associations. 
We have further shown that the algorithm can be used to complement a variety 
of association mining algorithms, from complete searches to maximal assocation 
mining. Regardless of the search algorithm, the number of significant associa- 
tions is always far smaller than the raw set returned, indicating the usefulness 
of statistical pruning as a post-processing step for any association mining. Fur- 
thermore, by selecting different p-value cutoffs for the pruning, the size of the 
final set of patterns can be easily controlled. 

There are several desirable features about this filtering algorithm. Firstly, 
the filtering can be applied to every predictive pattern independently of the 
others, meaning that batch parallelism could be used to simultaneously prune a 
large number of patterns. The parallel overhead would the re-combination of the 
relatively small number of pruned patterns into a single set of unique patterns. 

Also, the proposed pruning method can be used to evaluate the statistical 
reliability of rules against a more general probability model. In our examples, 
we evaluated the support of various patterns by counting records in the dataset. 
However, a more general probability model could be used to compute the ex- 
pected support levels, such as the maximum entropy model described in [15]. 

10 Conclusion 

We have presented a statistical approach for filtering and pruning predictive 
patterns identified through association mining. We have further shown that this 
filtering can be used with a variety of association mining algorithms, allow- 
ing a progressive filtering of a large collection of predictive patterns down to 
a relatively small set of significant patterns. This pruning can be performed in 
isolation, or as a pre-processing step for more computationally expensive pattern 
filtering algorithms. 
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Abstract. Associative classification is a well-known technique for struc- 
tured data classification. Most previous work on associative classification 
based the assignment of the class label on a single classification rule. In 
this work we propose the assignment of the class label based on simple 
majority voting among a group of rules matching the test case. 

We propose a new algorithm, L|^, which is based on previously proposed 
algorithm L®. L® performed a reduced amount of pruning, coupled with 
a two step classification process. L\j combines this approach with the use 
of multiple rules for data classification. The use of multiple rules, both 
during database coverage and classification, yields an improved accuracy. 



1 Introduction 

Association rules [1] describe the co-occurrence among data items in a large 
amount of collected data. Recently, association rules have been also considered 
a valuable tool for classification purposes. Classification rule mining is the dis- 
covery of a rule set in the training database to form a model of the data, the 
classifier. The classifier is then used to classify appropriately new data for which 
the class label is unknown [12]. Differently from decision trees, association rules 
consider the simultaneous correspondence of values of different attributes, hence 
allowing to achieve better accuracy [2,4,8,9,14]. 

Most recent approaches to associative classification (e.g., CAEP [4], CBA 
[9], ADT [14], and [2]) use a single classification rule to assign the class label 
to new data whose label is unknown. A different approach, based on the use of 
multiple association rules to perform classification of new data has been proposed 
in CMAR [8], where it has been shown that this technique yields an increase 
in the accuracy of the classifier. We believe that this technique can be applied 
orthogonally to almost any type of classifier. Hence, in this paper we propose 
L\j, a new algorithm which incorporates multiple rule classification into a 
level wise classifier previously proposed in [2]. 

was based on the observation that most previous approaches, when per- 
forming pruning to reduce the size of the rule base obtained from association 
rule mining, may go too far and discard also useful knowledge. We extend this 



N. Lavrac et al. (Eds.): PKDD 2003, LNAI 2838, pp. 35—46, 2003. 
© Springer- Verlag Berlin Heidelberg 2003 




36 



Elena Baralis and Paolo Garza 



idea to considering multiple rules to perform classification of new data. In this 
paper, we propose a new classification algorithm that combines the lazy 
pruning approach of L^, which has been shown to yield accurate classification 
results, with a rule assignment technique that selects the class label basing its 
decision on a group of eligible rules which are drawn either from the first or the 
second level of the classifier. 

The paper is organized as follows. Section 2 introduces the problem of asso- 
ciative classification. In Section 3 we present the classification algorithm L\j, by 
describing both the generation of the two levels of the classifier and the classifica- 
tion of test data by means of majority voting applied to its two levels. Section 4 
provides experimental results which validate the approach. Finally, in Sec- 
tion 5 we discuss the main differences between our approach and previous work 
on associative classification, and Section 6 draws conclusions. 

2 Associative Classification 

The database is represented as a relation R, whose schema is given by k distinct 
attributes Ai . . .Ak and a class attribute C. Each tuple in R can be described as 
a collection of pairs ( attribute, integer value ), plus a class label (a value belonging 
to the domain of class attribute C) . Each pair ( attribute, integer value ) will be 
called item in the reminder of the paper. A training case is a tuple in relation 
R, where the class label is known, while a test case is a tuple in R where the 
class label is unknown. 

The attributes may have either a categorical or a continuous domain. For cat- 
egorical attributes, all values in the domain are mapped to consecutive positive 
integers. In the case of continuous attributes, the value range is discretized into 
intervals, and the intervals are also mapped into consecutive positive integers^. 
In this way, all attributes are treated uniformly. 

A classifier is a function from Ai, . . . , An to C, that allows the assignment of 
a class label to a test case. Given a collection of training cases, the classification 
task is the generation of a classifier able to predict the class label for test cases 
with high accuracy. 

Association rules [1] are rules in the form X Y. When using them for 
classification purposes, A is a set of items, while E is a class label. A case d is 
said to match a collection of items X when X C d. The quality of an association 
rule is measured by two parameters, its support, given by the number of cases 
matching X U Y over the number of cases in the database, and its confidence 
given by the the number of cases matching X U Y over the number of cases 
matching X. Hence, the classification task can be reduced to the generation of 
the most appropriate set of association rules for the classifier. Our approach to 
such task is described in the next section. 



^ The problem of discretization has been widely dealt with in the machine learning 
community (see, e.g., [5]) and will not be discussed further in this paper. 
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3 Majority Classification 

In this paper, we introduce the use of multiple association rules to perform 
classification of structured data, in the levelwise classifier [2]. In L^, a lazy 
pruning technique is proposed, which only discards “harmful” rules, i.e., rules 
that only misclassify training cases. Lazy pruning is coupled with a two levels 
classification approach. Rules that would be discarded by currently used pruning 
techniques are included in the second level of the classifier and used only when 
first level rules are not able to classify a test case. 

Majority selection of the class label requires (a) selecting a group of good 
quality rules matching the case to be classified, and (b) assigning the appropriate 
class with simple majority voting among selected rules. To obtain a good quality 
rule set in step (a), a wide selection of rules from which to extract matching 
rules should be available. In Section 3.1 we describe how association rules are 
extracted, while in section 3.2 we discuss how the rules that form the model 
of the classifier are selected. Finally, in Section 3.3 the majority classification 
technique is presented. 



3.1 Association Rule Extraction 

Analogously to L^, in L\j abundance of classification rules allows a wider choice 
of rules both for rule selection when the classifier is generated, and for new case 
classification. Hence, during the rule extraction phase, the support threshold 
should be set to zero. Only the confidence threshold should be used to select 
good quality rules. Unfortunately, no rule mining algorithm extracting rules 
only with a confidence threshold is currently available^. 

In the extraction of classification rules is performed by means of an 
adaptation of the well-known FP-growth algorithm [7], which only extracts as- 
sociation rules with a class label in the head. Analogously to [8], we also perform 
pruning based on (see below) during the rule extraction process. 



3.2 Pruning Techniques and Classifier Generation 

In two pruning techniques are applied: pruning and lazy pruning, y^ is 

a statistical test widely used to analyze the dependence between two variables. 
The use of y^ as a quality index for association rules is proposed for the first time 
in [11] and is also used in [8] for pruning purposes. This type of pruning was not 
performed in L^. However, we performed a large number of experiments, which 
have shown that rules which do not match the y^ threshold are usually useless 
for classification purpose. Since the use of y^ test heavily reduces the size of the 
rule set, it may significantly increase the efficiency of the following steps without 
deteriorating the informative content (quality) of the rule set after pruning. We 
perform y^ pruning during the classification rule extraction step. 

^ Some attempt in this direction has been proposed in [13] , but its scalability is unclear. 
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Even with pruning, if a low minimim support threshold is used, a huge 
rule set may be generated during the extraction phase. However, most of these 
rules may be useful [2]. The second pruning technique used in is the lazy 
pruning technique proposed in L^. 

Before performing lazy pruning, a global order is imposed on the rule base. 
Rules are first sorted on descending confidence, next on descending support, 
then on descending length (number of items in the body of the rule), and finally 
lexicographically on items. The only significant difference with respect to most 
previous work ([8], [9]) is rule sorting on descending length. Most previous ap- 
proaches prefer short rules over long rules. The reason for our choice is to give a 
higher rank in the ordering to more specific rules (rules with a larger number of 
items in the body) over generic rules, which may lead to misclassification. Note 
that, since shorter rules are not pruned, they can be considered anyway. 

The idea behind lazy pruning [2] is to discard from the classifier only the 
rules that do not correctly classify any training case, i.e., the rules that only 
negatively contribute to the classification of training cases. To this end, after 
rule sorting, we cover the training cases to detect “harmful” rules (see Figure 1), 
using a database coverage technique. However, to allow a wider selection of rules 
for majority classification, a different approach is taken in the generation of the 
classifier levels. In a training document is removed from the data set when 
it is covered by S rules, while in each training case is removed as soon as is 
covered by one rule. Hence, by setting (5=1 the lazy pruning performed by 
degenerates in that of L^. 

Lines 1-26 of the pseudocode in Figure 1 show our approach. The first rule r 
in the sort order is used to classify each case d still in data (lines 3-11). Each case 
d covered by r is included in the set r.dataClassified, and the counter d. covered 
is increased. When d is covered by 5 rules {d. covered = 6), d is removed from 
data (line 9). The appropriate counter of r is increased (lines 6-7), depending 
on the correctness of the label. 

After all cases in data have been considered, r is checked. If rule r only clas- 
sified training cases wrongly (lines 12-18), then r is discarded, and the counter 
of each case classified by r is decreased by one. Cases included in d. covered and 
removed before (line 9), because covered by 6 rules, are included again in data 
(line 15). 

The loop (lines 2-20) is repeated for the next rule in the order, considering 
the cases still in data. The loop ends when either the data set or the rule set are 
empty. The remaining rules are divided in two groups (lines 21-26), which will 
form the two levels of the classifier: 

Level I which includes rules that have already correctly classified at least one 

training case. 

Level II which includes rules that have not been used during the training phase, 

but may become useful later. 

Rules in each level are ordered following the global order described above. 

Rules in level I provide a high level model of each class. Rules in level II, 
instead, allow us to increase the accuracy of the classifier by capturing “special” 
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Procedure generateClassifier(ruZes,data,(5) 

1. r = first rule of rules', 

2. while {data not empty) and (r not NULL) { 

3. for each d in data { 

4. if r matches d { 

5. r.dataClassified = r.dataClassified U d; 

6. if (d.class==r. class) r.right++; 

7. else r.wrong++; 

8. d.matched++; 

9. if (d.matched==(5) delete d from data', 

10. } 

11 . } 

12. if r.wrong>0 and r.right==0 { 

13. delete r from rules', 

14. for each d in r.dataClassified { 

15. if (d.matched==(5) data=data U d; 

16. d. matched- 

17. } 

18. } 

19. r=next rule from rules', 

20. } 

21. for each r in rules { 

22. if r.right>0 

23. levell = levell U r; 

24. else 

25. levelll = levelll U r; 

26. } 



Fig. 1. L%j classifier generation 



cases which are not covered by rules in the first level. Even if the levels are used 
similarly to L^, their size may be significantly different. In particular, the use of 
6 generally increases the size of the first level, compared to the first level of L^. 
Hence, is characterized by a first level which is “more fat” that that of L^. In 
Section 4 it is shown that this technique may provide a higher accuracy than L^, 
but the model includes more rules and is hence somewhat less readable as a high 
level description of the classifier. However, we note that the readability of the 
classifier generated by is still better than that of non-associative classifiers 
(e.g., Naive-Bayes [6]). 



3.3 Classification 

Majority classification is performed by considering multiple classification rules 
to assign the class label to a test case. The first step is the selection of a group of 
rules matching the given test case. When rules in the group yield different class 
labels, a simple majority voting technique is used to assign the class label. The 
size of the rule group (i.e., the maximum number of rules used to classify new 
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cases) may vary, depending on the number of rules matching the new case. It is 
limited by an upper bound, defined by the parameter maxjrules. 

This technique is combined with the two levels of the classifier. To build 
a rule group, rules in level I of the classifier are first considered. If no rule in this 
level matches the test case, then rules in level II are considered. Hence, rules in 
a rule group are never selected from both levels. 

When a new case is to be classified, the first level is considered. The algorithm 
selects (at most) the first maxjrules rules in the first level matching the case. 
When at least one rule matches the case, the matching process stops either at 
the end of the first level, or when the upper limit maxjrules is reached. 

Selected rules are divided in sets, one for each class label. Then, simple 
majority voting takes place. The rule set with the largest cardinality assigns 
the class label to the new case. A different approach, based on the evaluation of 
a weight for each rule set using (denoted as — max), is proposed in CMAR 
[8]. We performed a wide set of experiments which showed that the average 
accuracy obtained by using this method is slightly lower than that given by the 
simple majority technique described above, differently from what is reported in 
[8]. The difference between our results and those reported in [8] may be due to 
the elimination of redundand rules applied in CMAR and not in L^. 

If no rule in the first level matches the test case, then rules in level II are con- 
sidered. Both matching process and label assignment are repeated analogously 
for this level. 

We observe that the use of <5 > 1 during the lazy pruning phase is necessary 
when using the simple majority technique described before to classify new cases. 
Indeed, if 6 is set to one, only few rules are included in the first level, which 
becomes very thin. In this case, just a couple of rules may be available in the 
first level for matching and majority voting, and the selection of the appropriate 
class label may degenerate to the case of single rule classification. 

usually contains a large number of rules. In particular, the first level of 
the classifier contains a limited number of rules, which during the training phase 
covered some training cases. The cardinality of the first level is comparable to 
the size of the rule set in CMAR, while most previous approaches, including 
first level, were characterized by a smaller rule set. As shown in Section 4, this 
level performs the “heavy duty” classification of most test cases and provides a 
general model of each class. By contrast, level II of the classifier usually contains 
a large number of rules which are seldom used. These rules allow the classification 
of some more cases, which cannot be covered by rules in the first level. 

Since level I usually contains about 10^-10^ rules, it can easily fit in main 
memory. Thus, the main classification task can be performed efficiently. Level II, 
in our experiments, included around 10^-10® rules. Rules were organized in a 
compact list, sorted as described in Section 3.2. Level II of could generally be 
loaded in main memory as well. Of course, if the number of rules in the second 
level further increases (e.g., because the support threshold is further lowered to 
capture more rules with high confidence), efficient access may become difficult. 
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4 Experimental Results 

In this section we describe the experiments to measure accuracy and classifica- 
tion efficiency for . We compared with the classification algorithms CB A 
[9], CMAR [8], C4.5 [12], and with its previous version with single rule classifi- 
cation [2] . The differences between our approach and the above algorithms is 

further discussed in Section 5. A large set of experiments has been performed, 
using 26 data sets downloaded from UCI Machine Learning Repository [3]. The 
experiments show that L\j achieves a larger average accuracy (-1-0.47% over the 
best previous, i.e., L^), and has best accuracy on 10 data sets over 26. 

For classification rule extraction the mininum support threshold has been set 
to 1%, a standard value used by previous associative classifiers. For 5 data sets 
(auto,hypo,iono, sick, sonar) the mininum support threshold has been set to 5%, 
to limit the number of generated rules. The confidence constraint has not been 
enforced, i.e., minconf=0. We have adopted the same technique used by CBA to 
discretize continuous attributes. A 10 fold cross validation test has been used to 
compute the accuracy of the classifier. All the experiments have been performed 
on a lOOOMhz Pentium III PC with 1.5G main memory, running RedHat Linux 
7.2. 

Recall from Section 3 that the performance of depends on the values of 
two parameters: S and max-rules. 5 is used during the training phase and sets the 
maximal number of rules that can match a document, maxjrules is used during 
the classification phase and sets an upper bound on the size of the selected rule 
group before voting. A huge amount of experiments has been performed, using 
different values for the parameters 6 and max-rules. Unfortunately, it has not 
been possible to find overall optimal values for the parameters. However, we have 
devised values, denoted as default values, that yield a good average result, and 
are sufficiently appropriate for every data set considered in the experiments. The 
default values are 6 = 9, maxjrules = 9. These values may be used for “normal” 
classification usage, for any data set. 

We report in Figure 2 the variation of accuracy with varying S for different 
values of maxj'ules. For many data distributions, of which data set TicTac is 
representative, accuracy tends to be stable after a given threshold for param- 
eter values. Hence, default values and optimal values tend to be very close. A 
different behavior is shown by data set Cleve, for which high accuracy is as- 
sociated with very specific values of the parameters (e.g., optimal values are 

5 = 2, maxjrules = 9). For these exceptional cases, optimal values can only 
be computed by running a vast number of experiments in which values of the 
parameters are varied and average accuracy is evaluated on a ten fold. The value 
pair that yields the best accuracy is finally selected. This technique requires fine 
tuning for a specific data set and should be used only when very high accuracy 
is needed. 

Table 1 compares the accuracy of L\j with the accuracy of , C4.5, CBA and 
CMAR, obtained using standard values for all the parameters. In particular, the 
columns of Table 1 are: ( 1) name of data set, (2) number of attributes, (3) number 
of classes, (4) number of cases (records), (5) accuracy of C4.5, (6) accuracy of 
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Fig. 2. Variation of accuracy with varying <5 



CBA, (7) accuracy of CMAR, (8) accuracy of L^, (9) accuracy of with 
default values {6 = 9, max-rules = 9, identical for all data sets), (10) accuracy 
obtained using only the first level of (always with default parameters), (11) 
improvement in accuracy given by the second level of and (12) accuracy of 
with optimal values (different values of parameters for each data set). 

L\i, with default values, has best average accuracy (+0.47% with respect to 
Li^) and best accuracy on 10 of the 26 UCI data sets. Only for 7 data sets the 
accuracy achieved by L\^ is lower than that achieved by while for 15 data 
sets the accuracy is larger. Hence, the use of majority voting can improve the 
approach proposed by L^. 

We ran experiments to separate the contribution in accuracy improvement 
due to the use of multiple rules during the classification phase, and to the second 
level. In particular, we compared the accuracy obtained by only using rules in 
level I of L\j with the accuracy obtained by using both levels. The results of 
the experiments are reported in Table 1. The related columns of Table 1 are: 
(10) accuracy of L\j using only rules in the first level, (11) difference between 
with both levels (column (9)) and L\j with only first level (column (10)). 
By considering only rules in the first level, achieves best accuracy on 9 of the 
UCI data sets, and has average accuracy higher than (+0.25%). This result 
shows the significant effect due to multiple rule usage in the first level. 

The effect of the second level in L\j is definitely less relevant than in . We 
observe an increase in accuracy given by the second level only in 8 data sets, and 
the average increase in accuracy is +0.22%. The increase given by the second 
level of is more relevant [2]. In particular, for 20 data sets the second level 
is useful, and an average accuracy increase of +1.67% is given by the use of the 
second level. These results highlight that the second level is very useful when 6 
is set to 1 (A^) and the first level is very thin. However, its contribution is less 
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Table 1. Comparison of L%j accuracy with respect to previous algorithms 



Name 


A 


C 


R 


C4.5 


CBA 


GMAR 




T ^ 

default 

values 


Only 
I level 


^acc 


T ^ 

Optimal 

values 


Anneal 


38 


6 


898 


94.8 


97.9 


97.3 


96.2 


96.4 


96.4 


0.00 


96.4 


Austral 


14 


2 


690 


84.7 


84.9 


86.1 


85.7 


86.1 


86.1 


0.00 


86.4 


Auto^*^ 


25 


7 


205 


80.1 


78.3 


78.1 


81.5 


78.5 


76.6 


1.90 


81.5 


Breast 


10 


2 


699 


95.0 


96.3 


96.4 


95.9 


96.6 


96.6 


0.00 


96.7 


Cleve 


13 


2 


303 


78.2 


82.8 


82.2 


82.5 


82.5 


82.5 


0.00 


86.4 


Crx 


15 


2 


690 


84.9 


84.7 


84.9 


84.4 


85.5 


85.1 


0.40 


85.9 


Diabetes 


8 


2 


768 


74.2 


76.7 


74.5 


75.8 


78.6 


78.6 


0.00 


79.0 


German 


20 


2 


1000 


72.3 


73.4 


74.9 


73.8 


74.5 


74.5 


0.00 


74.7 


Glass 


9 


7 


214 


68.7 


73.9 


70.1 


76.6 


75.7 


75.7 


0.00 


76.6 


Heart 


13 


2 


270 


80.8 


81.9 


82.2 


84.4 


83.3 


83.3 


0.00 


84.4 


Hepatic 


19 


2 


155 


80.6 


81.8 


80.5 


81.9 


81.9 


81.3 


0.60 


83.2 


Horse 


22 


2 


368 


82.6 


82.1 


82.6 


82.9 


82.1 


81.8 


0.30 


83.2 


Hypo''*^ 


25 


2 


3163 


99.2 


98.9 


98.4 


95.2 


97.5 


97.5 


0.00 


97.5 


lono''*^ 


34 


2 


351 


90.0 


92.3 


91.5 


93.2 


92.8 


92.0 


0.80 


93.2 


Iris 


4 


3 


150 


95.3 


94.7 


94.0 


93.3 


93.3 


93.3 


0.00 


94.0 


Labor 


16 


2 


57 


79.3 


86.3 


89.7 


91.2 


96.5 


96.5 


0.00 


96.5 


Led7 


7 


10 


3200 


73.5 


71.9 


72.5 


72.0 


72.4 


72.4 


0.00 


72.8 


Lymph 


18 


4 


148 


73.5 


77.8 


83.1 


85.1 


84.5 


83.8 


0.70 


85.1 


Pima 


8 


2 


768 


75.5 


72.9 


75.1 


78.4 


78.0 


78.0 


0.00 


79.2 


Sick'^*^ 


29 


2 


2800 


98.5 


97.0 


97.5 


94.7 


94.7 


94.7 


0.00 


94.7 


Sonar'^*^ 


60 


2 


208 


70.2 


77.5 


79.4 


78.9 


81.7 


81.7 


0.00 


81.7 


Tic-tac 


9 


2 


958 


99.4 


99.6 


99.2 


98.4 


100.0 


100.0 


0.00 


100.0 


Vehicle 


18 


4 


846 


72.6 


68.7 


68.8 


73.1 


73.2 


73.0 


0.20 


73.2 


Waveform 


21 


3 


5000 


78.1 


80.0 


83.2 


82.1 


82.8 


82.8 


0.00 


82.8 


Wine 


13 


3 


178 


92.7 


95.0 


95.0 


98.3 


98.9 


98.3 


0.60 


98.9 


Zoo 


16 


7 


101 


92.2 


96.8 


97.1 


95.1 


97.0 


97.0 


0.00 


97.0 


Average 








83.34 


84.69 


85.22 


85.88 


86.35 


86.13 


0.22 


86.84 



{*)Minimum support threshold 5% 



relevant for i5 > 1, when the first level is already rich enough to allow a good 
coverage of test cases. The second level remains always useful to capture special 
cases and allows a further increase in accuracy. 

In column (12) of Table 1 are reported the accuracy results obtained by using 
optimal values of the S and maxjrules parameters for each data set. The effect 
of fine tuning the parameters values is significant, since it yields an increase of 
about +1% compared to and 0.49% with respect to L\j with default values. 
Furthermore, the classifier shows best accuracy on 17 data sets. 

Table 2 allows us to compare the structure and usage of the two levels for 
and L|^(with default values for the parameters). In order to analyze only 
the effect of multiple rule selection on the size of the two levels, Li^ has been 
modified to incorporate pruning. This allows us to observe the difference in 
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level size between and due exclusively to the level assignment technique 
based on multiple rule selection. 

In Table 2 the comparison of the number of rules in the first level of 
(column (3)) and of (column (4)) shows that the first level of is about 
an order of magnitude larger. We performed other experiments, not reported 
here, using different values for S, which showed that the first level of is 
approximately <5 times larger than the first level of L^. The only exception to 
this rule is data set Wine, where the size of the first level in and Li^ is 
comparable. In this case, the second level becomes more useful, since rules in 
the first level are not enough to cover all test cases. We can conclude that the 
first level of trades a reduced readability in favor of an increased accuracy 
and the value of 5 allows to fine tune the tradeoff between these two features. 

In Table 2 is also reported the number of rules in the second level for Li^ 
(column (5)) and L\[ (column (6)). We observe that the second level of L\j is 
usually slightly smaller than that of L^. This may be due to two different effects. 
(1) The first level of is larger and contains some rules that would have been 
assigned to the second level of . This effect is particularly evident in the case 
of data set Iris, where the total number of rules is rather small. In this case, 
most rules migrate from the second level to the first level, leaving an almost 
empty second level. (2) The multiple matching technique used for generating 
the classifier causes L\j to analyze more rules, which did not consider at all. 
If these rules make only mistakes, they are pruned by L^, but not by {L^ 
considers them unused and assigns them directly to the second level). 

We also analyzed the performance of L\j during the classification of test data. 
The classification time is not affected by the use of the second level, because it 
is used rarely (see column(8) of Table 2). The average time for classifying a new 
case is about 1ms, and is comparable to that reported for L^. With respect to 
memory usage, since the size of both levels is not dramatically different for 
and L^, the same considerations already reported in [2] hold also for L\[. 



5 Previous Related Work 

CMAR [8] is the first associative classification algorithm where multiple rules 
are used to classify new cases. CMAR proposes a suite of different pruning 
techniques: pruning of specialistic rules, use of the coefficient, and database 
coverage. In L\j pruning based on the x^ coefficient is adopted, but specialistic 
rules are not pruned. Our database coverage technique is more tolerant, since 
it allows more rules to cover the same training case. This effect depends on the 
value of the 5 parameter, discussed in Section 4. A similar parameter is available 
in CMAR (denoted as 5) , but its suggested value allows a lower number of rules 
during the selection step. Hence, in CMAR useful rules may be pruned, thus 
reducing the overall accuracy of the classifier. This problem has been denoted as 
overpruning in [2]. Furthermore, we use simple majority voting to assign the final 
class label to a test case, while in CMAR a more complex weighting technique 
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Table 2. Usage of the two levels 



Name 


R 


Rules 
I level 
L® 


Rules 
I level 
T ^ 


Rules 
II level 
L® 


Rules 
II level 

r3 


Use of 
I level 
r3 


Use of 
II level 

r3 


Anneal 


898 


38 


358 


169802 


168851 


99.44 


0.56 


Austral 


690 


152 


1458 


171638 


159165 


100.00 


0.00 


Breast 


699 


51 


516 


6241 


5407 


100.00 


0.00 


Cleve 


303 


74 


724 


16481 


14676 


100.00 


0.00 


Crx 


690 


159 


1422 


341382 


322675 


99.57 


0.43 


Diabetes 


768 


65 


360 


466 


180 


100.00 


0.00 


German 


1000 


291 


2420 


62359 


57786 


100.00 


0.00 


Glass 


214 


30 


274 


1385 


1047 


100.00 


0.00 


Heart 


270 


56 


506 


3449 


2725 


100.00 


0.00 


Hepatic 


155 


31 


313 


185453 


184757 


98.71 


1.29 


Horse 


368 


97 


888 


179345 


177803 


99.73 


0.27 


Iris 


150 


8 


82 


88 


13 


100.00 


0.00 


Labor 


57 


13 


85 


209 


119 


100.00 


0.00 


Led7 


3200 


75 


318 


1159 


980 


100.00 


0.00 


Lymph 


148 


40 


302 


1442098 


1441055 


95.95 


4.05 


Pima 


768 


64 


362 


472 


174 


100.00 


0.00 


Tic-tac 


958 


28 


599 


3258 


2566 


100.00 


0.00 


Vehicle 


846 


180 


1433 


2408341 


2406231 


99.53 


0.47 


Wine 


178 


8 


11 


122249 


122116 


99.40 


0.60 


Zoo 


101 


10 


72 


1515389 


1515288 


100.00 


0.00 


Average 












99.63 


0.37 



based on is proposed. Experiments show that our technique is both simpler 
and more effective. 

The L\j algorithm derives its two level approach from the Li^ algorithm 
proposed in [2] and enhances with the introduction of classification based on 
multiple rules. However, introducing majority voting requires a larger first level, 
which may reduce the readability of the model with respect to . Hence, the 
selection of an appropriate value for the 5 parameter allows the fine tuning of 
the richness of the first level. We observe that can be seen as a degenerate 
case of when both S and max -rules parameters are set to 1. 

Associative classification has been first proposed in CBA [9]. CBA, based 
on the Apriori algorithm, extracts only a limited number of association rules 
(max 80000). Furthermore, it applies a database coverage pruning technique 
that significantly reduces the number of rules in the classifier, thus losing relevant 
knowledge. A new version of the algorithm has been presented [10], in which the 
use of multiple supports is proposed, together with a combination of C4.5 and 
Naive-Bayes classifiers. Unfortunately, none of these techniques addresses the 
overpruning problem described in [2]. 

ADT [14] is a different classification algorithm based on association rules, 
combined with decision tree pruning techniques. All rules with a confidence 
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greater or equal to a given threshold are extracted and more specific rules are 
pruned. A decision tree is created based on the remaining association rules, on 
which classical decision tree pruning techniques are applied. Analogously to other 
algorithms, the classifier is composed by a small number of rules and prone to 
the overpruning problem. 

6 Conclusions 

In this paper we have described L\j, an associative classifier which combines 
levelwise classification with majority voting. This approach is a natural exten- 
sion of the concept of exploiting rule abundance for associative classification, 
initially proposed in [2]. In [2] rule abundance was only pursued when selecting 
rules to form the classifier by performing lazy pruning. With we extend the 
same concept to the classification phase, by considering multiple rules for label 
assignment. Experiments show that the adopted approach allows a good increase 
in accuracy with respect to previous approaches. The main disadvantage of this 
approach is the (slightly) reduced readability of the first level of the classifier, 
which should provide a general model of classes. 
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Abstract. Pushing monotone constraints in frequent pattern mining 
can help pruning the search space, but at the same time it can also reduce 
the effectiveness of anti-monotone pruning. There is a clear tradeoff. 
Is it better to exploit more monotone pruning at the cost of less anti- 
monotone pruning, or viceversa? The answer depends on characteristics 
of the dataset and the selectivity of constraints. In this paper, we deeply 
characterize this trade-off and its related computational problem. As a 
result of this characterization, we introduce an adaptive strategy, named 
ACP (Adaptive Constraint Pushing) which exploits any conjunction of 
monotone and anti-monotone constraints to prune the search space, and 
level by level adapts the pruning to the input dataset and constraints, 
in order to maximize efficiency. 



1 Introduction 

Constrained itemsets mining is a hot research theme in data mining [3, 4, 5, 6]. The 
most studied constraint is the frequency constraint, whose anti-monotonicity is 
used to reduce the exponential search space of the problem. Exploiting the anti- 
monotonicity of the frequency constraint is also known as the apriori trick [1]: 
this is a valuable heuristic that drastically reduces the search space making the 
computation feasible in many cases. Frequency is not only computationally ef- 
fective, it is also semantically important since frequency provides ’’support” to 
any discovered knowledge. For these reasons frequency is the base constraints 
and in general we talk about frequent itemsets mining. However, many other 
constraints can facilitate user focussed exploration and control as well as reduce 
the computation. For instance, a user could be interested in mining all frequently 
purchased itemsets having a total price greater than a given threshold and con- 
taining at least two products of a given brand. Classes of constraints sharing 
nice properties have been individuated. The class of anti-monotone constraints 
is the most effective and easy to use in order to prune the search space. Since 
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any conjunction of anti-monotone constraints is an anti-monotone constraint, 
we can use all the constraints in a conjunction to make the apriori trick more 
selective. 

The dual class, monotone constraints, has been considered more complicated 
to exploit and less effective in pruning the search space. As highlighted by Bouli- 
caut and Jeudy in [3], pushing monotone constraints can lead to a reduction of 
anti-monotone pruning. Therefore, when dealing with a conjunction of mono- 
tone and anti-monotone constraints we face a tradeoff between anti-monotone 
and monotone pruning. 

In [2] we have shown that the above consideration holds only if we focus 
completely on the search space of all itemsets, which is the approach followed so 
far. With the novel algorithm ExAnte we have shown that an effective way of at- 
tacking the problem is to reason on both the itemsets search space and the trans- 
actions input database together. In this way, pushing monotone constraints does 
not reduce anti-monotone pruning opportunities, on the contrary, such opportu- 
nities are boosted. Dually, pushing anti-monotone constraints boosts monotone 
pruning opportunities: the two components strengthen each other recursively. 
ExAnte is a pre-processing data reduction algorithm which reduces dramati- 
cally both the search space and the input dataset. It can be coupled with any 
constrained patterns mining algorithm, and it is always profitable to start any 
constrained patterns computation with an ExAnte preprocess. Anyway, after 
the ExAnte preprocessing, when computing frequent patterns we face again the 
tradeoff between anti-monotone and monotone pruning. 

The Tradeoff 

Suppose that an itemset has been removed from the search space because it 
does not satisfy a monotone constraint. This pruning avoids checking support 
for this itemset, but however if we check its support and find it smaller than the 
threshold, we may prune away all the supersets of this itemset. In other words, 
by monotone pruning we risk to loose anti-monotone pruning opportunities given 
by the removed itemset. The tradeoff is clear [3]: pushing monotone constraint 
can save tests on anti-monotone constraints, however the results of these tests 
could have lead to more effective pruning. 

On one hand we can exploit all the anti-monotone pruning with an apriori 
computation, checking the monotone constraint at the end, and thus not per- 
forming any monotone constraint pushing. We call this strategy g&t (generate 
and test). On the other hand, we can exploit completely any monotone pruning 
opportunity, but the price to pay is less anti-monotone pruning. We call this 
strategy mep (monotone constraint pushing). 

No one of the two extremes outperforms the other on every input dataset 
and conjunction of constraints. The best strategy depends of the characteristics 
of the input and the optimum is usually in between the two extremes. 

In this paper, we introduce a general strategy, ACP, that balances the two 
extremes adaptively. Both monotone and anti-monotone pruning are exploited 
in a level-wise computation. Level by level, while acquiring new knowledge about 




Adaptive Constraint Pushing in Frequent Pattern Mining 49 

the dataset and selectivity of constraints, ACP adapts its behavior giving more 
power to one pruning over the other in order to maximize efficiency. 



Problem Definition 

Let Items = be a set of distinct literals, usually called items. An 

itemset AT is a non-empty subset of Items. If k = |X| then X is called a 
k-itemset. A transaction is a couple {tID,X) where tID is the transaction 
identifier and X is the content of the transaction (an itemset). A transac- 
tion database TDB is a set of transactions. An itemset X is contained in a 
transaction (tID,Y) if X CY. Given a transaction database TDB the subset 
of transaction which contain an itemset X is named TDB[X]. The support 
of an itemset X, written supptdb{X) is the cardinality of TDB[X]. Given a 
user-defined minimum support <5, an itemset X is called frequent in TDB if 
supptdb(X) > S. This the definition of the frequency constraint Cfreq[TDB]: if 
X is frequent we write Cfreq[TDB]{X) or simply Cfr&q{X) when the dataset is 
clear from the context. Let Th{C) = {X\C{X)} denotes the set all itemsets X 
that satisfy constraint C. The frequent itemset mining problem requires to com- 
pute the set of all frequent itemsets Th{Cfreq)- In general given a conjunction 
of constraints C the constrained itemset mining problem requires to compute 
Th(C)\ the constrained frequent itemsets mininq problem requires to compute 
Th{Cfreq)nTh{C). 

Definition 1. Given an itemset X, a constraint Cam is anti-monotone if: 

VY C X : Cam{X) ^ Cam(Y) 

Definition 2. Given an itemset X, a constraint Cm is monotone if: 

VYDX:Cm{X)^Cm{Y) 

independently from the given input transaction database. 

Observe that the independence from the input transaction database is necessary 
since we want to distinguish between simple monotone constraints and global 
constraints such as the ’’infrequency constraint”: {supptdb{X) < 6). This con- 
straint is still monotone but it is dataset dependent and it requires dataset scans 
in order to be computed. 

Since any conjunction of monotone constraints is a monotone constraint, in 
this paper we consider the problem: 

Th{Cfreq) H Th{CM)- 

The concept of border is useful to characterize the solution space of the problem. 



Definition 3. Given an anti-monotone constraint Cam and a monotone con- 
straint Cm we define their borders as: 



B{Cam) = {X\yV C A : CAM{Y)AyZ D A : -^Cam(Z)} 
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Fig. 1. The borders B{Cm) and B{Cfreq) for the transaction database and the price 
table on the left with Cm = sum(X. prices) > 12 and Cam = supptdb(X) > 2. 



B{Cm) = {X\yY D X : Cm(F) A C X : ^ Cm{Z)} 

Moreover, we distinguish between positive and negative borders. Given a general 
constraint C we define: 

B+{C) = B{C)nTh{C) B~{C) = B{C)nTh{-^C) 

In Figure 1 we show the borders of two constraints: the anti-monotone con- 
straint supp{X) > 2, and the monotone one sum{X .prices) > 12. In the given 
situation the borders are: 

B^{Cm) = {e,abc,ahd,acd,hcd} B'^{Cfreq) = {hde,cde} 

B~{Cm) = {ab,ac,ad,bc,bd,cd} B~{Cfreq) = {a, be} 

The solutions to our problem are the itemsets that lie under the anti-monotone 
border and over the monotone one: R = {e, 6e, ce, de, bde, cde}. 

Our Contributions: 

In the next section we provide a through characterization of the addressed com- 
putational problem, and we compare the two opposite extreme strategies. Then 
we introduce a general adaptive strategy, named ACP, which manages the two 
extremes in order to adapt its behavior to the given instance of the problem. 
The proposed strategy has the following interesting features: 

— It exploits both monotone and anti-monotone constraints in order to prune 
the search space. 

— It is able to adapt its behavior to the given input in order to maximize effi- 
ciency. This is the very first adaptive algorithm in literature on the problem. 

— It computes the support of every solution itemset, which is necessary when 
we want to compute association rules. 

— Being a level-wise solution it can be implemented exploiting the many op- 
timization techniques and smart data structure studied for the apriori algo- 
rithm. 
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2 Level-Wise Solutions 

Strategy g&t performs an apriori computation and then tests among frequent 
itemsets which ones satisfy also the monotone constraint. Strategy mcp, intro- 
duced by Boulicaut and Jeudy [3] works the opposite. The border B+(Cm) is 
considered already computed and is given in input. Only itemsets over that 
border, and hence in Th{CM), are generated by a special generation procedure 
generatem. Therefore we just need to check frequency for these candidates. The 
procedure generatem takes in input the set of the solutions at the last iteration 
the border B^{Cm), and the maximal cardinality of an element in B~{Cm), 
which we denote maxb. In the rest of this paper we use Itemsk to denote the 
set of all k-itemsets. 



Procedure: generatem(fc, Rk, B^{Cm)) 

1. if fe = 0 then return B^{Cm) H Items 

2. else if fc < maxb then 

3. return generatei{Rk, BsTns) U (B^{Cm) C Itemsk+i) 

4. else if fc > maxb then 

5. return generatCapriori 

Where: 

— generatei{Rk, X) = {Au B\A € Rk A B € X} 

- generateapriori(Rk) = {X\X G Itemsk+i A V V € Itemsk ’-Y C X . T G Rk} 

The procedure generate^ creates as candidates only supersets of itemsets which 
are solution at the last iteration. Thus these candidates only need to be checked 
against C/req since they surely satisfy Cm- These candidates are generated adding 
to a solution a 1-itemset. Unluckily we can not use the apriori trick completely 
with this strategy. In fact a candidate itemset can not be pruned away simply 
because all its subsets are not solution, since some of them could have not been 
considered at all. What we can do is prune whenever we know that at least one 
subset of the candidate itemset is not a solution because it does not satisfy C/req. 
This pruning is performed on the set of candidates Ck by the following procedure. 



Strategy: mcp 

1. Cl := generatem(0, 0, B^{Cm)); Ro ■= 0; k := 1 

2. while Cfc 0 or fc < maxb do 

3 . Ck . — peUTlCmi^Rk — l j Ck') 

4. test Cfreq(Ck); Rk ~ Th{Cfreq) H (Ck) 

5 . Ck +1 := generatCmik, Rk, B^{Cm))\ k ~k + l 

6. end while 

7. return Uti 

Procedure: prunem(i7fc-i, Ufc) 

1 . C ~ Ck 

2. for all S' G Cfc do for all S' S : S' G Itemsk-i 

3 . do if S' ^ Rk-i ACm{S') then remove S from C' 

4. return C' 
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Example 4- Consider the executions of strategy mcp and strategy g&t on the 
dataset and the constraints in Figure 1, focussing on the numbers of checking of 
Cfreq- At the first iteration strategy mcp produces a unique candidate Ci = {e} 
which is the only 1-itemset in B'^{Cm)- This candidate is checked for the anti- 
monotone constraint and it results to be a solution Ri = {e}. At the second 
iteration 4 candidates are produced C 2 = {ae, be, ce, de}. Only ae does not satisfy 
Cam, hence R2 = {be,ce,de}. At the third iteration 7 candidates are produced 
C 3 = {abc, abd, acd, bed, bee, bde, ede}. Only two of these pass the anti-monotone 
checking: i?3 = {bde, ede}. Finally C4 = 0. Therefore, with the given dataset and 
constraints, strategy mcp performs H-4-1-7 = 12 checking of Cfreq- Strategy g&t 
uses a normal apriori computation in order to find Th{Cfreq) and then check 
the satisfaction of Cm- It performs 13 checking of Cfreq- 

Example 5. This example is borrowed by [3]. Suppose we want to compute fre- 
quent itemsets {X\supp{X) > 100 A |X| > 10}. This is a conjunction of an 
anti-monotone constraint (frequency) with a monotone one (cardinality of the 
itemset > 10). Strategy mcp generates no candidate of size lower than 10. Ev- 
ery itemset of size 10 is generated as candidate and tested for frequency in one 
database scan. This leads to at least where n = \Items\ candidates and, 
as soon as n is large this turns to be intractable. On the other hand strategy 
g&t generates candidates that will never be solutions, but this strategy remains 
tractable ever for large n. 

The two examples show that no one of the two strategies outperforms the 
other on every input dataset and conjunction of constraints. 

2.1 Strategies Analysis 

We formally analyze the search space explored by the two extreme level-wise 
strategies. To this purpose we focus on the number of frequency tests, since the 
monotone constraint is cheaper to test. 

Definition 6. Given a strategy S the number of frequency test performed by 
S is indicated with \Cfreq\s- 

Generally, a strategy S checks for frequency a portion of Th{CM) (which can 
produce solutions) and a portion of Th{-<CM) (which can not produce solutions): 

\Cfreq\s = 'y\Th{-,CM)\ + (3\Th{CM)\ 7 , / 3 ^[ 0 , 1 ] ( 1 ) 

The mcp strategy has 7 = 0, but evidently it has a f3 much larger than strategy 
g&t, since it can not benefit from the pruning of infrequent itemsets in Th{-'CM), 
as we formalize later. Let us further characterize the portion of Th{-'CM) ex- 
plored by strategy g&t as: 

7|T/i(“iCm)1 = 7i|T/i(“'Cm) n Tfe(C/j.e5)| + 72 |T/i(-'Cm) n B~(C/re4)| (2) 

For g&t strategy 71 = 72 = 1: it explores all frequent itemsets (Th{Cfreq)) and 
candidate frequent itemsets that results to be infrequent (B~ (Cfreq)) even if 
they are in Th(~CM) and thus can not produce solutions. Let us examine what 
happens over the monotone border. We can further characterize the explored 
portion of Th(CM) as: 
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/3|T/i(Cm)1 = /3i|-B^(Cm) n (T/l(-iC/re 9 ) \ -B {Cfreq))\+P2\R\+ (3) 

/Jsl-B (C/re(j) n r/l(CM)| 

Trivially /?2 = 1 for any strategy which computes all the solutions of the problem. 
Moreover, also /?3 is always 1 since we can not prune these border itemsets in 
any way. The only interesting variable is ( 3 \, which depends from 72. Since the 
only infrequent itemsets checked by strategy g&t are itemsets in B~{Cfreq), it 
follows that for this strategy /3i = 0. On the other hand, strategy mcp generates 
as candidates all itemsets in B'^{Cm) (see line 3 of generatem procedure), thus 
for this strategy j 3 \ = \. the following proposition summarizes the number of 
frequency tests computed by the two strategies. 

Proposition 7. 

\Cfreq\gSzt = \T h{-^C m) Th{C f req)\ 

+ |T/i(-iCm) n B (C/req)! + |B| + |B (C/req ) H T/i(Cm ) | 

\Cfrfiq\mcp = |B”^(Cm) H {Th(~Cfreq) \ B {Cfreq))\ + |B| + |B (Cfreq) H T/i(Cm)| 

In the next section we introduce an adaptive algorithm which manages this 
tradeoff, balancing anti-monotone and monotone pruning w.r.t. the given input. 



3 Adaptive Constraint Pnshing 

The main drawback of strategy g&t is that it explores portions of search space in 
Th{-'CM) which will never produce solutions (71 = 1); while the main drawback 
of strategy mcp is that it can generate candidate itemsets in Th{~< Cfreq) \ 
B~ (Cfreq) that would have been already pruned by a simple apriori computation 
(Pi = 1). This is due to the fact that strategy mcp starts computation bottom- 
up from the monotone border and has no knowledge about the portion of search 
space below such a border(T/i(-i Cm))- However, some knowledge about small 
itemsets which do not satisfy Cm could be useful to have a smaller number 
of candidates over the border. But on the other hand, we need some additional 
computation below the monotone border in order to have some knowledge. Once 
again we face the tradeoff. The basic idea of a general adaptive pushing strategy 
(ACP) is to explore only a portion of Th(~< Cm)' this computation will never 
create solutions, but if well chosen it can prune heavily the computation in 
Th(CM)- In other terms, it tries to balance 7 and Pi. To better understand we 
must further characterize the search space Th(-CM)'- 

7|T/i(-iCm) 1 = 7l|r/l(-iCM) n T/l(C/req)| -f 72 |Tfl(“'CM) n B (Cfreq)\+ (4) 
+J3\Th(-^CM) n (Th(-iCfreq) \ B (Cfreq))\ 

Strategy ACP tries to reduce 71 but the price to pay is a possible reduction of 
72. Since the portion of search space T/i(-iCm) O B~ (Cfreq) is helpful to prune, 
a reduction of 72 yields a reduction of pruning opportunities. This can lead to 
the exploration of a portion of search space (73 > 0) that would have not been 
explored by a g&t strategy. This phenomenon can be seen as a virtual raising of 
the frequency border: suppose we loose an itemset of the frequency border, we 
will later explore some of its supersets and obviously find them infrequent. By 
the point of view of the strategy these are frequency border itemsets, even if they 
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are not really in B~ (Cfreq)- The optimal ideal strategy would have 71 = 73 = 0 
and 72 = 1 since we are under the monotone border and we are just looking for 
infrequent itemsets in order to have pruning in Therefore, a general 

ACP strategy should explore a portion of Th{-<CM) in order to find infrequent 
itemsets, trying not to loose pieces of Th{~CM) H B~{Cfreq)- 
Proposition 8. 

\Cfreq\ideal — |T/i(-iCm) n -B {Cjreq)\ + \R\ + \B {C Jreq) Th{C m)\ 

\Cfreq\acp = ')\Th[-CM)\ +/3|T/i(Cm)| (tts defined in equations (3) and (4))- 
Two questions arise: 

1. What is a ’’good” portion of candidates? 

2. How large this set of candidates should he? 

The answer to the first question is simply: ’’itemsets which have higher proba- 
bilities to he found infrequent”. The answer to the second question is what the 
adaptivity of ACP is about. We define a parameter a G [0, 1] which represents 
the fraction of candidates to be chosen among all possible candidates. This pa- 
rameter is initialized after the first scan of the dataset using all information 
available, and it is updated level by level with the newly collected knowledge. 

Let us now introduce some notation useful for the description of the algo- 
rithm. 



- Lfc C {7| 7 G {Itemsk Pi Th{Cfreq) P Th{-^ Cm))} 

- C {7| 7 e {Itemsk P Th{-^Cfreq) P Th{-^ Cm))} 

- 7?fc = {7| 7 G {Itemsk P Th{Cfreq) P T/i(Cm))} 

- Pfc = {7| 7 G Itemsk A Vn, m < k.($L C I.L G 73™,) A J C 7. J G W)} 

- Bfc = {7|7G [B+{CM)r^Pk)} 

- Ek = {I\I G{Th{-~,CM)r\Pk)} 

Lk is the set of frequent k-itemsets which are under the monotone border: 
these have been checked for frequency even if they do not satisfy the monotone 
constraint hoping to find them infrequent; is the set of infrequent k-itemsets 
under the monotone border: these are itemsets used to prune over the monotone 
border; Rk is the set of solutions k-itemsets; Pk is the set of itemsets potentially 
frequent (none of their subsets have been found infrequent) and potentially in 
B^{Cm) (none of their subsets have been found satisfying Cm)- Bk is the subset 
of elements in Pk which satisfy Cm and hence are in B^{Cm) since all their 
subsets are in Th{~^ Cm) ■ Ek is the subset of elements in Pk which still do not 
satisfy Cm- From this set is chosen an a-portion of elements to be checked against 
frequency constraint named (candidates Under). This selection is indicated 
as a® Ek. Finally we have the set of candidates in which we can find solutions 
C® (candidates Over) which is the set of candidates over the monotone border. 
The frequency test for these two candidates sets is performed with a unique 
database scan and data structure. Itemsets in which satisfy Cfreq will be 
solutions; itemsets in which do not satisfy Cfreq go in Nk and will prune 
itemsets in C® for some j > k. 
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We now introduce the pseudo-code for the generic adaptive strategy. In the 
following with the sub-routine generateover, we mean generatei followed by 
the pruning of itemsets which are superset of itemsets in N (we call it pruneam)) 
followed by prunem (decribed in Section 2). 



Strategy: generic A CP 

1. i?o,lV:=0; Cl := Items 

2. test Cfreq(Cl) ^ Cl Th(Cfreq) H (Cl) 

3. test Cm(Ci) Ri := T}i{Cm) H (Ci); Li := Th(—i Cm) H (Ci) 

4. P 2 ~ generateapriori{Li) 

5. k:=2 

6. while Pk do 

7. test C]vi(Pk) ^ Bk := Th(CM) C {Pk)\ Ek ’■= Th{^CM) H (Pk) 

8. Cf := generateover{Rk-i,Ci, N) U Bk 

9. initialize /update{a) 

10 . := a® Ek 

11. testCfreq(C5^UC^) ^ 

Rk ~ Th{Cfreq) H C® ; Lk := ThiCfr^q) n Ck ; Nk := Th(-tCfreq) n Ck 

12. N:=NUNk 

13. Pk-ki genevateapriori)Ek \ Ek) 

14. k := k+1 

15. end while 

16. Ck ~ generateover[Rk-i,Ci, N) 

17. while Cfc yi: 0 do 

18. test Cfreq(Ck); Rk ~ Th{Cfreq) D {Ck) 

19. Ck-\-i . — genevate(ipriori{Rk) 

20. k:=k + l 

21. end while 

22. return Uti^ 



It is worthwhile to highlight that the pseudo-code given in Section 2 for 
strategy mcp, which is a theoretical strategy, does not perform the complete 
first anti-monotone test, while strategies ACP and g&t perform it. This results 
to be a reasonable choice on our toy-example, but it turns to be a suicide choice 
on every reasonably large dataset. Anyway, we can imagine that any practical 
implementation of mcp would perform at least this first anti-monotone test. We 
call this practical implementation strategy mcp*. Moreover, strategy mcp* does 
not take the monotone border in input but it discovers it level-wise as ACP does. 
In our experiments we will use mcp* instead of mcp. 

Note that: 

— if a = 0 constantly, then ACP = strategymcp*; 

— if a = 1 constantly, then ACP = strategy g&t] 

Our adaptivity parameter can be seen as a setting knob which ranges from 
0 to 1, from an extreme to the other. To better understand how ACP works, we 
show the execution given the input in Figure 1. 
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3.1 Run-through Example 

Strategy ACP starts with Ci = {a,b,c,d,e}, tests the frequency constraint, 
tests the monotone constraint and finds the first solution, Ri = {e} and Li = 
{b,c,d}. Now (Line 4) we generate the set of 2-itemsets potentially frequent 
and potentially in B^{Cm)'- P 2 = {6c, M, cd}. At this point we enter in the 
loop from line 6 to 15. The set P 2 is checked for Cm, and it turns out that 
no element in P 2 satisfies the monotone constraint: thus B 2 = %, E 2 = P 2 - At 
line 8 ACP generates candidates for the computation over the monotone border 
C? = |6e , ce, de\ and performs the two pruning procedure that in this case have 
no effects. At this point ACP initializes our adaptivity parameter a G [0, 1]. The 
procedure initialize{a) can exploit all the information collected so far, such as 
number of transactions, total number of 1-itemsets, support threshold, number of 
frequent 1-itemsets and their support, number of solutions at the first iteration. 
For this example suppose that a is initialized to 0.33. Line 10 assigns to C^ , a 
portion equals to a of Intuitively, since we want to find infrequent itemsets 
in order to prune over the monotone border, the best third is the 2-itemset which 
has the subsets with the lowest support, therefore C^ = |6c}. Line 11 performs 
the frequency test for both set of candidates sharing a unique dataset scan. The 
count of support gives back four solutions R 2 = |6e, ce,de}, moreover we have 
L2 = 0 and N 2 = |6c}. Then ACP generates P3 = 0 (line 13) and exits the loop 
(line 6). At line 16 we generate Cz = |6de, cde}, we check their support (line 18) 
and obtain that Rz = C3; finally we obtain C4 = 0 and we exit the second loop. 
Algorithm ACP performs 5-|-4-|-2 = 11 tests of frequency. 

4 Adaptivity Strategies and Optimization Issnes 

In the previous section we have introduced a generic strategy for adaptive con- 
straint pushing. This can not really be considered an algorithm since we have 
left not instantiated the initialize/update function for the adaptivity parameter 
a (line 9), as well as the a-selection (line 10). In this section we propose a very 
simple adaptivity strategy for a and our first experimental results. We believe 
that many other different adaptivity strategies can be defined and compared. 

Since we want to select itemsets which are most likely infrequent, the simplest 
idea is to estimate on the fly, using all information available at the moment, a 
support measure for all candidates itemsets below the monotone border. Then 
the a-selection (line 10) will simply choose among all itemsets in Ek the a- 
portion with lowest estimated support. 

In our first set of experiments, we have chosen to estimate the support for 
an itemset using only the real support value of items belonging to the given 
itemset, and balancing two extreme conditions of the correlation measure among 
the items: complete independence and maximal correlation. In the former the 
estimated itemset support is obtained as the product, and in the latter as the 
minimum, of relative support of the items belonging to the itemset. Also for 
the a-adaptivity we have chosen a very simple strategy. The parameter a is 
initialized w.r.t. the number of items which satisfy frequency and monotone 
constraint at the first iteration. Then at every new iteration it adapts its value 
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Dataset Connect-4 Monotone Constraint Threshold 3000 





Fig. 2. Number of candidate itemsets tested against Cfreq- 



according to the results of the a-selection at the previous iteration. Let us define 
the a- focus as the ratio of itemsets found infrequent among a-selected itemsets. 
An a-focus very close to 1 (greater than 0.98) suggests that we have selected 
and counted too few candidates and thus we raise the a parameter for the next 
iteration accordingly. An a-focus less then 0.95 suggests that we are selecting 
and counting too much candidates and thus produces a shrink of a. 

These two proposed strategies for estimating candidates support and for the 
adaptivity of a do not exploit all available information, but they allow an efficient 
implementation, and they experimentally exhibit very good candidates-selection 
capabilities. 



Experimental Results 

Since A CP balances the tradeoff between frequency and a monotone constraint , 
it gives the best performance when the two components are equally strong, i.e. no 
constraint is much more selective than the other. On sparse datasets frequency 
is always very selective even at very low support levels: joining it with an equally 
strong monotone constraint would result in an empty set of solutions. Therefore, 
ACP is particularly interesting in applications involving dense datasets. 

In Figure 2, we show a comparison of the 4 strategies g&t, mcp* , ideal and 
ACP, based on the portion of search space explored, i.e. the number of Cfreq tests 
performed, on the well-known dense dataset connect-^ for different support 
thresholds and monotone constraints. In order to create a monotone constraint 
we have attached to each item a value v selected using a normal distribution. 
Then we have chosen as monotone constraint the sum of values v in an itemset 
to be greater than a given threshold. 

Strategy mcp* always performs very poorly and its results could not be 
reported in the graph in Figure 2. Strategy g&t explores a portion of search 
space that obviously does not depend by the monotone constraint. On this dense 
dataset it performs poorly and becomes hard to compute for low supports (less 

^ http:/ /www. ics.uci.edu/~mlearn/MLRepository.html 
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than 55%). Our simple strategy for selecting candidates under the monotone 
border provides a very good performance: during the first 3-4 iterations (where 
is more important not to miss infrequent itemsets) we catch all the infrequent 
itemsets with an a « 0.2; i.e. checking only a fifth of all possible candidates. 
Thanks to this capability, our ACP strategy does not loose low-cardinality item- 
sets in B~{Cfreq) and thus approximates very well the ideal strategy, as showed 
by Figure 2, performing a number of Cfreq tests one order of magnitude smaller 
than strategy g&t. 

5 Conclusions 

In this paper, we have deeply characterized the problem of the computation of 
a conjunction of monotone and anti-monotone constraints. As a result of this 
characterization, we introduce a generic adaptive strategy, named ACP (Adap- 
tive Constraint Pushing) which exploits any conjunction of monotone and anti- 
monotone constraints to prune the search space. We have introduce an adaptivity 
parameter, called a which can be seen as a setting knob which ranges from 0 
(favorite monotone pruning) to 1 (favorite anti-monotone pruning) ; and level by 
level adapts itself to the input dataset and constraints, giving more power to one 
pruning over the other in order to maximize efficiency. The generic algorithmic 
architecture presented can be instantiated with different adaptivity strategy for 
a. In this paper we have presented a very simple strategy which does not do 
not exploit all available information, but it still provides very good selection 
capability and it allows an efficient implementation. 
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Abstract. Constraint pushing techniques have been proven to be effec- 
tive in reducing the search space in the frequent pattern mining task, 
and thus in improving efficiency. But while pushing anti-monotone con- 
straints in a level-wise computation of frequent itemsets has been rec- 
ognized to be always profitable, the case is different for monotone con- 
straints. In fact, monotone constraints have been considered harder to 
push in the computation and less effective in pruning the search space. 
In this paper, we show that this prejudice is ill founded and introduce 
ExAnte, a pre-processing data reduction algorithm which reduces dra- 
matically both the search space and the input dataset in constrained fre- 
quent pattern mining. Experimental results show a reduction of orders 
of magnitude, thus enabling a much easier mining task. ExAnte can be 
used as a pre-processor with any constrained pattern mining algorithm. 



1 Introduction 

Constrained itemset mining i.e., finding all itemsets included in a transaction 
database that satisfy a given set of constraints, is an active research theme in 
data mining [3,6,7,8,9,10,11,12]. The most studied constraint is the frequency 
constraint, whose anti-monotonicity is used to reduce the exponential search 
space of the problem. Exploiting the anti-monotonicity of the frequency con- 
straint is known as apriori trick [1,2]: it dramatically reduces the search space 
making the computation feasible. Frequency is not only computationally effec- 
tive, it is also semantically important since frequency provides “support” to any 
discovered knowledge. For these reasons frequency is the base constraint of what 
is generally referred to as frequent itemset mining. However, many other con- 
straints can facilitate user-focussed exploration and control, as well as reduce 

* The present research is founded by “Fondazione Cassa di Risparmio di Pisa” under 
the “WebDigger Project”. 
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the computation. For instance, a user could be interested in mining all fre- 
quently purchased itemsets having a total price greater than a given threshold 
and containing at least two products of a given brand. Among these constraints, 
classes have been individuated which exhibit nice properties. The class of anti- 
monotone constraints is the most effective and easy to use in order to prune the 
search space. Since any conjunction of anti-monotone constraints is in turn anti- 
monotone, we can use the apriori trick to exploit completely the pruning power 
of the conjunction: the more anti-monotone constraints, the more selective the 
apriori trick will be. 

The dual class, monotone constraints, has been considered more complicated 
to exploit and less effective in pruning the search space. As highlighted by Bouli- 
caut and Jeudy in [3], pushing monotone constraints can lead to a reduction of 
anti-monotone pruning. Therefore, when dealing with a conjunction of mono- 
tone and anti-monotone constraints we face a tradeoff between anti-monotone 
and monotone pruning. Our observation is that the above consideration holds 
only if we focus completely on the search space of all itemsets, which is the 
approach followed by the work done so far. 

In this paper we show that the most effective way of attacking the prob- 
lem is to reason on both the itemsets search space and the transactions input 
database together. In this way, pushing monotone constraints does not reduce 
anti-monotone pruning opportunities, on the contrary, such opportunities are 
boosted. Dually, pushing anti-monotone constraints boosts monotone pruning 
opportunities: the two components strengthen each other recursively. We prove 
our previous statement by introducing ExAnte, a pre-processing data reduc- 
tion algorithm which reduces dramatically both the search space and the input 
dataset in constrained frequent pattern mining. 

ExAnte can exploit any constraint which has a monotone component, there- 
fore also succinct monotone constraints [9] and convertible monotone constraints 
[10,11] can be used to reduce the mining computation. Being a preprocessing al- 
gorithm, ExAnte can be coupled with any constrained pattern mining algorithm, 
and it is always profitable to start any constrained patterns computation with 
an ExAnte preprocess. The correctness of ExAnte is formally proven in this pa- 
per, by showing that the reduction of items and transaction database does not 
affect the set of constrained frequent patterns, which are solutions to the given 
problem, as well as their support. We discuss a thorough experimentation of the 
algorithm, which points out how effective the reduction is, and which potential 
benefits it offers to subsequent frequent pattern computation. 



Our Contributions: 

Summarizing, the data reduction algorithm proposed in this paper is character- 
ized by the following: 

— ExAnte uses, for the first time, the real synergy of monotone and anti- 
monotone constraints to prune the search space and the input dataset: the 
total benefit is greater than the sum of the two individual benefits. 
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— ExAnte can be used with any constraint which has a monotone component: 
therefore also succinct monotone constraints and convertible monotone con- 
straints can be exploited. 

— ExAnte maintains the exact support of each solution itemset: a necessary 
condition if we want to compute Association Rules. 

— ExAnte can be used to make feasible the discovery of particular patterns 
which can be discovered only at very low support level, for which the com- 
putation is unfeasible for traditional algorithms. 

— Being a pre-processing algorithm, ExAnte can be coupled with any con- 
strained pattern mining algorithm, and it is always profitable to start any 
constrained pattern computation with an ExAnte preprocess. 

— ExAnte is efficient and effective: even a very large input dataset can be 
reduced of an order of magnitude in a small computation. 

— A thorough experimental study has been performed with different monotone 
constraints on various datasets (both real world and synthetic datasets), and 
the results are described in details. 

2 Problem Definition 

Let Items = {xi, ..., x„} be a set of distinct literals, usually called items. An 
itemset AT is a subset of Items. If |X| = k then X is called a k-itemset. A 
transaction is a couple {tlD^ X) where tID is the unique transaction identifier 
and X is the content of the transaction (an itemset). A transaction database 
TDB is a finite set of transactions. An itemset X is contained in a transaction 
{tID, Y) a X (lY . Given a transaction database TDB the subset of transactions 
which contain an itemset X is denoted TDB\X], The support of an itemset X, 
written supptdb{^) is the cardinality of TDB\X], Given a user-defined mini- 
mum support 5, an itemset X is called frequent in TDB if supptdb(X) > S. 
This defines the frequency constraint Cfreq[TDB]' if X is frequent we write 
Cfreq[TDB]{X) Or simply Cfreq{X) when the dataset is clear from the context. 

Let Th{C) = {X\C{X)} denotes the set all itemsets X that satisfy constraint 
C. The frequent itemset mining problem requires to compute the set of all fre- 
quent itemsets Th{Cfreq)- In general given a conjunction of constraints C the 
constrained itemset mining problem requires to compute Th{C); the constrained 
frequent itemsets mining problem requires to compute Th{Cfreq) C\Th{C). 

We now formally define the notion of anti-monotone and monotone con- 
straints. 

Definition 1. Given an itemset X, a constraint Cam is anti-monotone if 
'iY CX ■.Cam{X)^Cam{Y) 

If Cam holds for X then it holds for any subset of X. □ 

The frequency constraint is clearly anti-monotone. This property is used by the 
APRIORI algorithm with the following heuristic: if an itemset X does not satisfy 
Cfreq, then no superset of X can satisfy Cfreq, and hence they can be pruned. 
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This pruning can affect a large part of the search space, since itemsets form 
a lattice. Therefore the APRIORI algorithm operates in a level-wise fashion 
moving bottom-up on the itemset lattice, and each time it finds an infrequent 
itemset it prunes away all its supersets. 

Definition 2. Given an itemset X, a constraint Cm is monotone if: 

yVDX :Cm{X)^Cm{Y) 

independently from the given input transaction database. If Cm holds for X then 
it holds for any superset of X. □ 

Note that in the last definition we have required a monotone constraint to be 
satisfied independently from the given input transaction database. This is nec- 
essary since we want to distinguish between simple monotone constraints and 
global constraints such as the “infrequency constraint”: 

supptdb(X) < S. 

This constraint is still monotone but has different properties since it is dataset 
dependent and it requires dataset scans in order to be computed. Obviously, 
since our pre-processing algorithm reduces the transaction dataset, we want to 
exclude the infrequency constraint from our study. Thus, our study focuses on 
“local” monotone constraints, in the sense that they depend exclusively on the 
properties of the itemset (as those ones in Table 1), and not on the underlying 
transaction database. 

The general problem that we consider in this paper is the mining of itemsets 
which satisfy a conjunction of monotone and anti-monotone constraints: 

Th{CAM) n Th{CM)- 

Since any conjunction of anti-monotone constraints is an anti-monotone con- 
straint, and any conjunction of monotone constraints is a monotone constraint, 
in this paper we focus on the problem given by the conjunction of frequency anti- 
monotone constraint {Cam = supptdb{X) > 6) , with various simple monotone 
constraints (see Table 1). 



Th{Cfreq) H Th{CM)- 

However, our algorithm can work with any conjunction of anti-monotone con- 
straints, provided that the frequency constraint is included in the conjunction: 
the more anti-monotone constraints, the larger the data reduction will be. 

3 Search Space and Input Data Reduction 

As already stated, if we focus only on the itemsets lattice, pushing monotone 
constraint can lead to a less effective anti-monotone pruning. Suppose that an 
itemset has been removed from the search space because it does not satisfy some 
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Table 1. Monotone constraints considered in our analysis. 



Monotone constraint 


Cm = 


cardinality 


|A| > n 


sum of prices 


sum{X .prices) > n 


maximum price 


max[X.prices) > n 


minimum price 


min{X .prices) < n 


range of prices 


range{X .prices) > n 



monotone constraints Cm- This pruning avoids checking support for it, but it 
may be that if we check support, the itemset could result to be infrequent, and 
thus all its supersets could be pruned away. By monotone pruning an itemset we 
risk to lose anti-monotone pruning opportunities given from the itemset itself. 
The tradeoff is clear [3]: pushing monotone constraint can save tests on anti- 
monotone constraints, however the results of these tests could have lead to more 
effective pruning. In order to obtain a real amalgam of the two opposite pruning 
strategies we have to consider the constrained frequent patterns problem in its 
whole: not focussing only on the itemsets lattice but considering it together with 
the input database of transactions. In fact, as proved by the theorems in the 
following section, monotone constraints can prune away transactions from the 
input dataset without losing solutions. This monotone pruning of transactions 
has got another positive effect: while reducing the number of transactions in 
input it reduces the support of items too, hence the total number of frequent 
1-itemsets. In other words, the monotone pruning of transactions strengthens 
the anti-monotone pruning. Moreover, infrequent items can be deleted by the 
computation and hence pruned away from the transactions in the input dataset. 
This anti-monotone pruning has got another positive effect: reducing the size of 
a transaction which satisfies a monotone constraint can make the transaction 
violates the monotone constraint. Therefore a growing number of transactions 
which do not satisfy the monotone constraint can be found. We are clearly 
inside a loop where two different kinds of pruning cooperates to reduce the 
search space and the input dataset, strengthening each other step by step until 
no more pruning is possible (a fix-point has been reached). This is precisely the 
idea underlying ExAnte. 

3.1 ExAnte Properties 

In this section we formalize the basic ideas of ExAnte. First we define the two 
kinds of reduction, then we prove the completeness of the method. In the next 
section we provide the pseudo-code of the algorithm. 

Definition 3 (/i-reduction). Given a transaction database TDB and a mono- 
tone constraint Cm, we define the ^-reduction of TDB as the dataset resulting 
from pruning the transactions that do not satisfy Cm- 

gi[TDB]c^ = {{tID, X) \ {tID, X) G TDB A A G T/i(Cm)} 

□ 
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Definition 4 (a- reduction). Given a transaction database TDB, a transac- 
tion (tID,X) and a frequency constraint Cfreq[TDB], we define the a-reduction 
of {tID,X) as the transaction resulting from pruning the items in X that do 
not satisfy Cfreq[TDB]. 

a[{tID,X)]c,^^^[TDB] = {tID,{F,nX)) 

Where: Fi = {i G Items \ {z} G Th{Cfreq[TDB])}. We define the a-reduction of 
TDB as the dataset resulting from the a-reduction of all transactions in TDB. 

□ 

The following two key theorems state that we can always /z-reduce and a- 
reduce a dataset without reducing the support of solution itemsets. Moreover, 
since satisfaction of Cm is independent from the transaction dataset, all solution 
itemsets will still satisfy it. Therefore, we can always ^-reduce and a-reduce a 
dataset without losing solutions. 

Theorem 5 (/i- reduction correctness). Given a transaction database TDB, 
a monotone constraint C m , and a frequency constraint Cfreq, we have that: 

yx G Th{Cfreq[TDB]) n Th{CM) ■■ supptdb{X) = supp^YrDB]c,,,{X). 

Proof. Since X G Th{CM), all transactions containing X will also satisfy Cm 
for the monotonicity property. Therefore no transaction containing X will be 
/i-pruned (in other words: TDB[X] C fj,[TDB]ci^). This, together with the def- 
inition of support, implies the thesis. □ 



Theorem 6 (a-reduction correctness). 

Given a transaction database TDB, a monotone constraint Cm, and a fre- 
quency constraint Cfreq, we have that: 

yx G Th{Cfreq[TDB]) n Th{CM) ■ SUPPtDb(X) = SUpPa[TDB]cf^^^(X)- 

Proof. Since X G Th{Cfreq), all subsets of X will be frequent (by the anti- 
monotonicity of frequency). Therefore no 1-itemsets in X will be a-pruned (in 
other words: TDB[X] C a[TDB]cf.,.,,,^) . This, together with the definition of 
support, implies the thesis. □ 

3.2 ExAnte Algorithm 

The two theorems above suggest a fix-point computation. ExAnte starts the 
first iteration as any frequent pattern mining algorithm: counting the support 
of singleton items. Items that are not frequent are thrown away once and for 
all. But during this first count only transactions that satisfy Cm are considered. 
The other transactions are signed to be pruned from the dataset (/z-reduction) . 
Doing so we reduce the number of interesting 1-itemsets. Even a small reduction 
of this number represents a huge pruning of the search space. At this point 
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ExAnte deletes from alive transactions all infrequent items (a-reduction). This 
pruning can reduce the monotone value (for instance, the total sum of prices) 
of some alive transactions, possibly resulting in a violation of the monotone 
constraints. Therefore we have another opportunity of ^-reducing the dataset. 
But /x-reducing the dataset we create new opportunities for a-reduction, which 
can turn in new opportunities for /x-reduction, and so on, until a fix-point is 
reached. The pseudo-code of ExAnte algorithm follows: 



Procedure: ExAnte(TDB, Cm , min_supp) 

1 . 1 = 0 ; 

2. forall transactions t in TDB do 

3. if CM{t) then forall items i in t do 

4. i.count-\—\--, if i.count == min.supp then I = I U {!}; 

5. old_number interesting -items = |/tems|; 

6. while |/| < old-number interesting items do 

7. TDB = a[TDB]c,^^^-, 

8. TDB = p[TDB]c^-, 

9. oldmumber interesting items = 17|; 

10 . 7 = 0 ; 

11. forall transactions t in TDB do 

12. forall items i in t do 

13. i.count + -I-; 

14. if i.count == minsupp then 7 = 7 U {i}; 

15. end while 



Fig. 1. The ExAnte algorithm pseudo-code. 



Clearly, a fix-point is eventually reached after a finite number of iterations, 
as at each step the number of alive items strictly decreases. 

3.3 Run through Example 

Suppose that the transaction and price datasets in Table 2 are given. Suppose 
that we want to compute frequent itemsets {minsupp = 4) with a sum of prices 
> 45. During the first iteration the total price of each transaction is checked 
to avoid using transactions which do not satisfy the monotone constraint. All 
transaction with a sum of prices > 45 are used to count the support for the 
singleton items. Only the fourth transaction is discarded. At the end of the count 
we find items a, e, / and h to be infrequent. Note that, if the fourth transaction 
had not been discarded, items a and e would have been counted as frequent. 
At this point we perform an a-reduction of the dataset: this means removing 
a, e, / and h from all transactions in the dataset. After the a-reduction we have 
more opportunities to /x-reduce the dataset. In fact transaction 2, which at the 
beginning has a total price of 63, now has its total price reduced to 38 due to 
the pruning of a and e. This transaction can now be pruned away. The same 
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Table 2. Run-through Example: price table (a) and transaction database (b), items 
and their supports iteration by iteration (c). 



tID 


Itemset 


Total price 




Supports 


1 


b,c,d,g 


58 




Items 


Et 


2nd 


3rd 


2 


a,b,d,e 


63 




a 


3 


t 


t 


3 


b,c,d,g,h 


70 




b 


7 


4 


4 


4 


a,e,g 


31 




c 


5 


5 


4 


5 


c,d,f,g 


65 




d 


7 


5 


4 


6 


a,b,c,d,e 


77 




e 


3 


t 


t 


7 


a,b,d,f,g,h 


76 




f 


3 


t 


t 


8 


b,c,d 


52 




g 


5 


3 


t 


9 


b,e,f,g 


49 




h 


2 


t 


t 



item 


price 


a 


5 


b 


8 


c 


14 


d 


30 


e 


20 


f 


15 


g 


6 


h 


12 



(a) (b) (c) 



reasoning holds for transactions number 7 and 9. At this point ExAnte counts 
once again the support of alive items with the reduced dataset. The item g which 
initially has got a support of 5 now has become infrequent (see Table 2 (c) for 
items support iteration by iteration). We can a-reduce again the dataset, and 
then ^-reduce. After the two reductions transaction number 5 does not satisfy 
anymore the monotone constraint and it is pruned away. ExAnte counts again 
the support of items on the reduced datasets but no more items are found to 
have turned infrequent. The fix-point has been reached at the third iteration: the 
dataset has been reduced from 9 transactions to 4 transactions (number 1,3,6 
and 8), and interesting itemsets have shrunk from 8 to 3 (b,c and d). At this 
point any constrained frequent pattern mining algorithm would find very easily 
the unique solution to problem which is the 3-itemset {6, c, d}. 

4 Experimental Results 

In this section we deeply describe the experimental study that we have con- 
ducted with different monotone constraints on various datasets. In particular, 
the monotone constraints used in the experimentation are in Table 1. In addi- 
tion, we have experimented a harder to exploit constraint: avg{X.prices) > n. 
This constraint is clearly neither monotone nor anti-monotone, but can exhibit 
a monotone (or anti-monotone) behavior if items are ordered by ascending (or 
descending) price, and frequent patterns are computed following a prefix-tree 
approach. This class of constraints, named convertible, has been introduced in 
[10]. In our experiments the constraint avg{X. prices) > n is treated by inducing 
a weaker monotone constraint: max{X .prices) > n. Note that in every reported 
experiment we have chosen monotone constraints thresholds that are not very 
selective: there are always solutions to the given problem. In the experiments 
reported in this paper we have used two datasets. “IBM” is a synthetic dataset 
obtained with the most commonly adopted dataset generator, available from 
IBM Almaden^. We have generate a very large dataset since we have not been 

^ http: / /www.almaden. ibm. com/software/ quest /Resources/dat asets/syndat a. html^^assocSynData 
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0 



2 4 6 8 

Cardinality threshold 

(a) 




Fig. 2. Transactions reduction (a), and interesting 1-itemsets reduction (b), on dataset 
“IBM”. 




■nin_supp = 0.1 
■nin_supp = 0.05 



Dataset "IBM". Cardinality constraint 



Table 3. Characteristics of the datasets used in the experiments. 









Max 


Avg 


Dataset 


Transactions 


Items 


Trans 

Size 


Trans 

Size 


IBM 


8,533,534 


100,000 


37 


11.21 


Italian 


186,824 


4800 


31 


10.42 



Dataset 


Min Price 


Max Price 


Avg Price 


Italian 


100 


900,000 


6454.87 



able to find a real-world dataset over one million transactions. “Italian” is a 
real-world dataset obtained from an Italian supermarket chain within a market- 
basket analysis project conducted by our research lab, few years ago (note that 
the prices are in the obsolete currency Italian Lira). 

For a more detailed report of our experiments see [5]. In Figure 2 (a) the 
reduction of the number of transactions w.r.t the cardinality threshold is shown 
for four different support thresholds on the synthetic dataset. When the car- 
dinality threshold is equal to zero the number of transactions equals the total 
number of transactions in the database, since there is no monotone pruning. 
Already for a low support threshold as 0.1% with a cardinality threshold equals 
to 2 the number of transactions decreases dramatically. Figure 2 (b) describes 
the reduction of number of interesting 1-itemsets on the same dataset. 

As already stated, even a small reduction in the number of relevant 1-itemsets 
represents a very large pruning of the search space. In our experiments, as a mea- 
sure of the search space explored, we have considered the number of candidate 
itemsets generated by a level-wise algorithm such as Apriori. In Figure 3 is 
reported a comparison of the number of candidate itemsets generated by Apri- 
ori and by ExAnteApriori (ExAnte pre-processing followed by Apriori) on the 
“Italian” dataset with various constraints. The dramatic search space reduction 
is evident, and it will be confirmed by computation time reported in the next 
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Fig. 3. Search space reduction on dataset “Italian” . 

section. How the number of candidate itemsets shrinks by increasing strength of 
the monotone constraint is also reported in Figure 3. This figure also highlights 
another interesting feature of ExAnte: even at very low support level (min_supp 
= 5 on a dataset of 186,824 transactions) the frequent patterns computation is 
feasible if coupled with a monotone constraint. Therefore, ExAnte can be used 
to make feasible the discovery of particular patterns which can be discovered 
only at very low support level, for instance: 

— extreme purchasing behaviors (such as patterns with a very high average of 
prices); 

— very long patterns (using the cardinality constraint coupled with a very low 
support threshold). 

We report run-time comparison between Apriori and ExAnteApriori. We 
have chosen Apriori as the “standard” frequent pattern mining algorithm. Recall 
that every frequent pattern mining algorithm can be coupled with ExAnte pre- 
processing obtaining similar benefits. Execution time is reported in Figure 4. The 
large search space pruning reported in the previous section is here confirmed by 
the execution time. 

5 Related Work 

Being a pre-processing algorithm, ExAnte can not be directly compared with any 
previously proposed algorithm for constrained frequent pattern mining. However, 
it would be interesting to couple ExAnte data reduction with those algorithms 
and to measure the improve in efficiency. Among constrained frequent pattern 
mining algorithms, we would like to mention TXC^ [11] and the recently pro- 
posed DualMiner [4]. 

6 Conclusions and Future Work 

In this paper we have introduced ExAnte, a pre-processing data reduction algo- 
rithm which reduces dramatically the search space the input dataset, and hence 
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Support (%) 



Iteration 


Transactions 


1-itemsets 


0 


17306 


2010 


1 


13167 


1512 


2 


11295 


1205 


3 


10173 


1025 


4 


9454 


901 


5 


9005 


835 


6 


8730 


785 


7 


8549 


754 


8 


8431 


741 


9 


8397 


736 


10 


8385 


734 


11 


8343 


729 


12 


8316 


726 


13 


8312 


724 


14 


8307 


722 


15 


8304 


722 


Execution time: 1.5 sec 



Fig. 4. A typical execution of ExAnte: Dataset “Italian” with minsup = %40 and 
sum of prices > 100000 (on the left); and a runtime comparison between Apriori and 
ExAnteApriori with two different constraints (on the right). 



the execution time, in constrained frequent pattern mining. We have proved 
experimentally the effectiveness of our method, using different constraints on 
various datasets. Due to its capacity in focussing on any particular instance 
of the problem, ExAnte exhibits very good performance also when one of the 
two constraints (the anti-monotone or the monotone) is not very selective. This 
feature makes ExAnte useful to discover particular patterns which can be dis- 
covered only at very low support level, for which the computation is unfeasible 
for traditional algorithms. 

We are actually developing a new algorithm for constrained frequent pattern 
mining, which will take full advantage of ExAnte pre-processing. We are also in- 
terested in studying in which other mining tasks ExAnte can be useful. We will 
investigate its applicability to the constrained mining of closed itemsets, sequen- 
tial patterns, graphs structure, and other complex kinds of data and patterns. 
ExAnte executable can be downloaded by our web site: 
http: / /www-kdd. cnuce.cnr.it / 
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Abstract. Due to the potentially immense amount of frequent sets 
that can be generated from transactional databases, recent studies have 
demonstrated the need for concise representations of all frequent sets. 
These studies resulted in several successful algorithms that only generate 
a lossless subset of the frequent sets. In this paper, we present a unifying 
framework encapsulating most known concise representations. Because of 
the deeper understanding of the different proposals thus obtained, we are 
able to provide new, provably more concise, representations. These theo- 
retical results are supported by several experiments showing the practical 
applicability. 



1 Introduction 

The frequent itemset mining problem is by now well known [1]. We are given a 
set of items X and a database T> of subsets of X. The elements of X) are called 
transactions. An itemset I C X is some set of items; its support in T), denoted 
support{I ,T>), is defined as the number of transactions in T> that contain all 
items of I. An itemset is called s- frequent in T> if its support in T> exceeds s. The 
database T> and the minimal support s are omitted when they are clear from the 
context. The goal is now, given a minimal support threshold and a database, to 
find all frequent itemsets. The set of all frequent itemsets is denoted T{X),s), 
the set of infrequent sets is denoted if{V, s). 

Recent studies on frequent itemset mining algorithms resulted in significant 
performance improvements. However, if the minimal support threshold is set too 
low, or the data is highly correlated, the number of frequent itemsets itself can 
be extremely large. To overcome this problem, recently several proposals have 
been made to construct a concise representation [13] of the frequent itemsets, 
instead of mining all frequent itemsets: Closed sets [2,4,14,15,16], Free sets [5], 
Disjunction-Free Sets [6,10], Generalized Disjunction- Free Generators [12,11], 
and Non- Derivable Itemsets [8]. 

A Concise Representation of frequent sets is a subset of all frequent sets with 
their supports that contains enough information to construct all frequent sets 
with their support. Therefore, based on the representation, for each itemset I, 
we must be able to (a) decide whether I is frequent, and (b) if / is frequent, 
produce its support. 

Mannila et al. [13] introduced the notion of a concise representation in a more 
general context. Our definition resembles theirs, but for reasons of simplicity we 
only concentrate on representations that are exact, and for frequent itemsets. 
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For representations the term concise will refer to their space-efficiency; that 
is, TZ is called more concise than TV if for every database T> and support threshold 
s, 'R-{V, s) is smaller than or equal to TZ'{'D, s). 

We introduce new representations based on the deduction rules for support 
presented in [8]. Many of the proposals in the literature, such as the free sets [5], 
the disjunction-free sets [6,10], the generalized disjunction-free sets [12,11], the 
disjunction-free generators [10], the generalized disjunction- free generators [11,12], 
and the non-derivahle itemsets [8] representations, will be shown to be mani- 
festations of this method. As such, the proposed method serves as a unifying 
framework for these representations. 

The organization of the paper is as follows. In Section 2 we briefly describe 
different concise representations in the literature. Section 3 revisits the deduction 
rules introduced in [8]. In Section 4, a unifying framework for different concise 
representations is given, based on the deduction rules. Also new, minimal, rep- 
resentations are introduced. In Section 5 we present the results of experiments 
concerning the size of the different representations. 



2 Related Work 

Closed Sets. The first successful concise representation was the closed set repre- 
sentation introduced by Pasquier et al. [14]. In short, a closed set is an itemset 
such that its frequency does not equal the frequency of any of its supersets. The 
collection of the frequent closed sets together with their supports is a concise 
representation. This representation will be denoted ClosedRep . 

Generalized Disjunction- Free Sets. [11,12] Let X,Y be two disjunct itemsets. 
The disjunctive rule X ^\J Y is said to hold in the database D, if every trans- 
action in V that contains X, also contains at least one item of Y . A set / is 
called generalized disjunction-free if there do not exist disjunct subsets X, Y of 
I such that X ^ \J Y holds. The set of all generalized disjunction free sets is 
denoted GDFree. 

In [12], a representation based on the frequent generalized disjunction-free 
sets is introduced. On the one hand, based on the supports of all subsets of a 
set / (including I), it can be decided whether / is generalized disjunction-free 
or not. On the other hand, if a disjunctive rule X ^ \J Y holds, the support of 
every superset I of X UY can be constructed from the supports of its subsets. 
For example, a ^ bV c holds if and only if for every superset X of abc, 

supp{X) = supp{X — 6) -I- supp{X — c) — supp{X — be) . 

Hence, if we know that a rule X ^\JY holds, there is no need to store supersets 
of A U P in the representation. 

However, the set of frequent generalized disjunction-free sets FGDFree is not 
a representation. We illustrate this with an example. Suppose that FGDFree 
completed with the supports is {(0, 10), (a, 5), (&, 4), (c, 3), (o6, 3)}. What con- 
clusion should be taken for the set acl There can be two reasons for ac to be left 
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out of the representation: (a) because ac is infrequent, or (b) because ac is not 
generalized disjunction-free. Furthermore, suppose that ac was left out because 
it is not generalized disjunction-free. Since we have no clue which disjunctive rule 
holds for ac, we cannot produce its support. Hence, FGDFree completed with 
the supports of the sets clearly is not a representation. This problem is resolved 
in [12] by adding a part of the harder of the set FGDFree to the representation. 

Definition 1. Let S be a set of items ets. B{S) = {J \ J ^ S,yj' C J : J' £ S'}. 

Suppose that we also store the sets in B{FGDFree) in the representation. 
Let / be a set not in FGDFree U B {FGDFree). There exists a set J C / in 
B{FGDFree). The set J is either infrequent, or not generalized disjunction-free. 
If J is infrequent, then / is as well. If J is not generalized disjunction- free, then 
the supports of all subsets of J (including the support of J) allow for determining 
the rule X ^\J Y that holds for J. Hence, we know a rule X ^\JY that holds 
for / {X, Y Q J C I). Therefore, from the supports of all strict subsets of I, we 
can derive the support of / using this rule. Using induction on the cardinality 
of I, it can easily be proven that FGDFree U B(FGDFree) completed with the 
supports is a representation. For the details, we refer to [11,12]. 

It is also remarked in [12] that it is not necessary to store the complete 
border B{FGDFree). For example, we could decide to leave out the infrequent 
sets. When reconstructing the complete set of frequent itemsets, we will be able 
to recognize these infrequent sets in the border because they are the only sets that 
have all their strict subsets in FGDFree, but that are not in the representation 
themselves. Other alternatives are the generalized disjunction-free generators 
representation {GDFreeGenRep) [12] and the representations in Section 4. 

Free and Disjunction- Free Sets. [5,6,10] Free and disjunction-free sets are special 
cases of generalized disjunction- free sets. For free sets, the righthand side of the 
rules X — >■ V y is restricted to singletons, for disjunction free sets to singletons 
and pairs. Hence, a set I is free if and only if there does not exist a rule X ^ a 
that holds with X U {a} C I, and I is disjunction-free if there does not exists 
a rule X ^ aV b that holds with X U {a, b} C I. The free and disjunction-free 
sets are denoted respectively by Free and DFree, the frequent free and frequent 
disjunction- free sets by FFree and FDFree. 

Again, neither FFree nor FDFree completed with the supports form a con- 
cise representation. The reasons are the same as explained for the generalized 
disjunction- free sets above. Hence, for the representations based on the free sets 
and the disjunction-free sets, (parts of) the border must be stored as well. Which 
parts of the border are stored can have a significant influence on the size of the 
representations, since the border is often very large, sometimes even larger than 
the total number of frequent itemsets. 

However, the parts of the border that are stored in the representations pre- 
sented in [5,6,10,11,12] are often far from optimal. In this paper we describe a 
unifying framework for these disjunctive-rule based representations. This frame- 
work is based on the deduction rules for support presented in [8] and revisited 
in Section 3. The framework allows a neat description of the different strategies 
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used in the free, disjunction-free and generalized disjunction- free based repre- 
sentations. Due to the deeper understanding of the problem resulting from the 
unifying framework, we are able to find new and more concise representations 
that drastically reduce the number of sets to be stored. 

3 Deduction Rules 

In this section we review the deduction rules introduced in [8] . These rules derive 
bounds on the support of an itemset I if the supports of all strict subsets of I 
are known. In [7], it is shown that these rules are sound and complete; that is, 
they compute the best possible bounds. 

Let a generalized itemset be a conjunction of items and negations of items. 
For example, G = {a, b, c, d} is a generalized itemset. A transaction T contains a 
general itemset G = X UY if X CT and TC\Y = 0. The support of a generalized 
itemset G in a database T> is the number of transactions of T> that contain G. 

We say that a general itemset G = A U T is based on itemset I if I = XUY. 
From the well known inclusion-exclusion principle [9], we know that for a given 
general itemset G = X L)Y based on I, 

support{G) = ^ {—1)'^'^^^'' support{J) . 

XCJCI 

Since supp{G) is always larger than or equal to 0, we derive 

(_l)|4\^lgyppu^^(j) > Q 

XCJCI 

If we isolate supp{I) in this inequality, we obtain the following bound on the 
support of /: 

supp{I) < (— J) If |/ \ J| odd 

XCJCI 

supp{I) > (— J) If |I \ J| even 

XCJCI 

This rule will be denoted TZi{X). Depending of the sign of the coefficient of 
supp{I), the bound is a lower or an upper bound. If |/ \ X\ is odd, TZi{X) is 
an upper bound, otherwise it is a lower bound. Thus, given the supports of all 
subsets of an itemset I, we can derive lower and upper bounds on the support 
of I with the rules TZj{X) for all G = A U T based on I. 

We denote the greatest lower bound on I by LB{I) and the least upper bound 
by UB{I). The complexity of the rules TZi{X) increases exponentially with the 
cardinality of /\ A. The number |/\A| is called the depth of rule TZi{X). Since 
calculating all rules is often tedious, we sometimes restrict ourselves to only rules 
of limited depth. More specifically, we denote the greatest lower and least upper 
bounds on the support of I resulting from evaluation of rules up to depth k 
by LBk{I) and UBk{I)- Hence, the interval [LBk{I),UBk{I)] are the bounds 
calculated by the rules {TZi{X) | A C /, | / \ A| < fc}. 
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Example 1. Consider the following database: 



TID 


Items 


supp(abc) > 


1 


a 


< 


2 


h 


< 


3 


c 


< 


4 


a, b 


> 


5 


a, c 


> 


6 


b, c 


> 


7 


a, b, c 


< 



0 

^ab — 2 
^ac — 2 
^hc — 2 

Sab + Sac ~ Sa = 0 
Sab Sf)c Sff — 0 
Sac + She — Sc = 0 

Sab + Sac + She — Sa ~ Sb — Sc E S0 = 1 



The rules above are the rules TZabc{X) for X respectively abc, ab, ac, be, a, b, c, 0. 
The first rule has depth 0, the following three rules depth 1, the next three rules 
depth 2, and the last rule has depth 3. Hence, LBo(abc) = 0, LB 2 {abc) = 0, 
UBi(abc) = 2, UBsiabc) = 1. □ 



Links Between TZi{X), the support of XUY, and X ^ \/ Y. Let I be an itemset, 
and G = X U Y a generalized itemset based on I. From the derivation of the 
rule TZi{X), it can be seen that the difference between the bound calculated by 
it, and the actual support of I equals the support of X U T. Hence, the bound 
calculated by TZi{X) equals supp{I) if and only if supp{X U F) = 0. It is also 
true that the disjunctive rule X ^ \J Y holds if and only if supp{X U F) = 0. 
Indeed, if supp{X\JY) is 0, then there are no transactions that contain X but do 
not contain any of the items in F. Therefore, we obtain the following theorem. 

Theorem 1. Let L he an itemset, and G = X U Y a generalized itemset based 
on L. The following are equivalent: 

(a) The hound caleulated by TZi{X) equals the support of L, 

(b) supp{G) = 0, and 

(c) The disjunctive rule X ^\/Y holds. □ 



Example 2. We continue Example 1. Since the bound 1 calculated by 7^abc(0) 
equals supp{abc), supp{abc) must be 0. Indeed, there is no transaction that con- 
tains none of a, 6, or c. Hence, the disjunctive rule 0 — >■ a V 6 V c holds. On the 
other hand, the difference between the bound calculated by 'R-abc{o) and the 
actual support of abc is 1. Hence, supp{a U he) = 1. □ 

4 Unifying Framework 

In [8] we introduced the NDI representation based on the deduction rules which 
we repeated in Section 3. The NDI-representation was defined as follows: 

NDLRep{V,s) =uef {{I,supp{L,V)) \ supp{I,V) > s,LB{I) ^ UB{I)} 

Hence, if a set / is not in the representation, then either LB{L) = UB{I), and 
hence the support of / is determined uniquely by the deduction rules, or L is 
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infrequent. A set / with LB{I) = UB{I) is called a derivable itemset (DI), oth- 
erwise it is called a non-derivahle itemset (NDI). Derivability is anti-monotone, 
which allows an Apriori-like algorithm [8]. 

NDIRep is the only representation that is based on logical implication. For 
every set I not in the representation, I is either infrequent in every database 
consistent with the supports in NDIRep, or every such database gives the same 
support to I. All other representations are based on additional assumptions. For 
example, in the disjunction-free generators representation there is an explicit 
assumption that all sets in the border of FGDFree that are not in the represen- 
tation, are not free. Such assumptions make it possible to reduce the size of the 
representations. 

In this section, we add similar assumptions to the NDI-based representations. 
In order to do this, we identify different groups of itemsets: itemsets that are 
frequent versus those that are infrequent, sets that have support equal to the 
lower bound, equal to the upper bound, etc. Based on these groups a similar 
strategy as for the free, the disjunction-free, and the generalized disjunction- free 
representations will be followed. We identify minimal sets of groups that need 
to be stored in order to obtain a representation. 

4.1 fe-Free Sets 

The fc-free sets will be a key tool in the unified framework. 

Definition 2. 

A set I is said to be fc-free, if supp{I) yf LBk{I) and supp{I) yf UBk{I)- 
A set I is said to be oo-free, if supp{I) yf LB{I), and supp{I) yf UB{I). 

The set of all k-free (oo-free) sets is denoted Freek (Freeoo)- □ 

As the next lemma states, these definitions cover freeness, disjunction-freeness, 
and generalized disjunction-freeness. The proof is based on Theorem 1, but is 
omitted because of space restrictions. 

Lemma 1. Let I be an itemset. 

— I is free if and only if I is 1-free 

— I is disjunction free if and only if I is 2-free. 

— I is generalized disjunction-free if and only if I is oo-free. 

k-freeness is anti-monotone; if a set I is k-free, then all its subsets are k-free as 
well. Moreover, if supp{J) = LBk{J) (supp{J) = UBk{J)), then also supp{I) = 
LBk{I) (supp{I) = UBk{I)), for all .J C I. 

4.2 Groups in the Border 

Let now FFreck be the frequent k-iree sets. As we argued in Section 2 for the gen- 
eralized disjunction- free representations, FFreCk is not a representation. Indeed, 
if a set I is not in the representation, there is no way to know whether / was left 
out of the representation because / is infrequent, or because supp{I) = LBk{I), 
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Frequent 





supp{I) 

supp{I) 

supp{I) 

supp{I) 

supp{I) 

supp{I) 

supp{I) 



= LBk{I)= UBk{I) 
= LBk{I) + UBk{I) 

= UBk{I)^ LBk{I) 

= LBk{I) = UBk{I) 
= LRfc(7) ^ UBk{I) 
= UBk{I)^ LBk{I) 
/ LBk{I) + UBk{I) 



flu 



flu 




> t 
< t 



ilu 




< t 
> t 



ilu 




< t 
> t 



cflu 



uflu 



cilu 



uilu 



cilu 



uilu 



Fig. 1. This tree classifies every set in B{FFreek) in the right gronp. Only the groups 
that are in a rectangle need to be stored in a representation. 



or because supp{I) = UBk{I)- To resolve this problem, parts of the border 
B{FFreek) have to be stored as well. If we can restore the border exactly, then 
also the other frequent sets can be determined. This can be seen as follows: if 
a set I is not in B{FFreek), and not in FFreck, then it has a subset J in the 
border. If this set J is infrequent, then so is /. If supp{J) = LBk{J), then also 
supp{I) = LBk{I), and, if supp{J) = UBk{J), then also supp{I) = UBk{I) 
(Lemma 1). Hence, if we can restore the complete border, then we can restore 
all necessary information. 

The sets in B{FFreek) can be divided in different groups, depending on 
whether they are frequent or not, have frequency equal to the lower bound or 
not, and have frequency equal to the upper bound or not. In order to make the 
discussion easier, we introduce a 3-letter notation to denote the different groups 
in the border. The first letter denotes whether the sets in the group are frequent: 
/ is frequent, i is infrequent. The second letter is I if the sets I in the group 
have supp{I) = LBk{I), otherwise it is 1. The third letter is u for groups with 
supp{I) = UBk{I), and u otherwise. The rule depth k is indicated as a subscript 
to the notation. For example, fluk denotes the group 

fluk =def B{FFreek) n (1 {I \ supp{I) ^ LBk{I)} 

n {/ I supp{I) = UBk{I)} , 

and iluk denotes the group 

iluk =def B{FFreek) fl T {I \ supp{I) = LBk{I)} 

n {/ I supp{I) yf UBk{I)} ■ 

We split some of the groups even further, based on whether or not the bounds 
LBk{I), and UBk{I) allow to conclude that a set is certainly frequent or certainly 
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infrequent. For example, in the group flu, we distinguish between sets I such 
that the bounds allow to derive that I is frequent, and the other sets. That is, 
cflu (c of certain), is the set 

cfluk =def B{FFreek) n C\ {I \ supp{I) ^ LBk{I)} 

n {/ I supp{I) = UBk{I)} n {/ I LBk{I) > s}. 

The other sets of flu are in uflu {u of uncertain). Thus, uflu is the set 

ufluk =def B{FFreek) n T C\ {I \ supp{I) ^ LBk{I)} 

n {/ I supp{I) = UBk{I)} n {/ I LBk{I) < s}. 

Some of the groups only contain certain or uncertain sets, such as flu. Since 
flu only contains frequent sets I with supp{I) = LBk{I), automatically the 
condition LBk{I) > s is fulfilled. The different groups are depicted in Figure 1. 

The tree in this figure indicates to which group a set / G B{FFreek) belongs. 
For example, a frequent set with supp{I) = LBk{I), and supp{I) yf UBk{I), 
takes the upper branch at the first split, since it is frequent, and the second 
branch in the second split. Notice that there are no groups with code flu, because 
sets that are frequent and have a frequency that equals neither the lower, nor the 
upper bound, must be in FFreck and hence cannot be in B{FFreek). To make 
notations more concise, we will sometimes leave out some of the letters. For 
example, flk denotes the union fluk U fluk, and ilk denotes iluk Uciluk Uuiluk- 

4.3 Representations Expressed with FFree^ and the Groups 

We can express many of the existing representations in function of FFreck for a 
certain k, and a list of groups in the border of FFreck- Table 1 describes different 
existing representations in this way. The correctness of this table is proven in [7]. 
The first line of the table for example, states that the free sets representation 
actually is 

{{{I, supp{I)) I I G FFreei}, flui,cilui,uilui,cilui,uilui) . 

We do not differentiate between a representation that stores the different groups 
separately, or in one set; that is, storing the one set flu U flu is considered the 
same as storing the pair of sets {flu, flu). The reason for this is that for space 
usage the difference between the two is not significant. 

The notation fuao.i and iuoo.i for the generalized disjunction-free generators 
representation indicates that in this representation, FFreCoo is used as basis, but 
for pruning the border B{FFreeoo), only rules up to depth 1 are used. In the 
experiments however, we will use the other rules for pruning the border as well, 
and hence we report a slightly better size for this representation. 

4.4 Minimal Representations 

We can not distinguish between two itemsets within the same group if we only 
use comparisons between their lower and upper bound, their support, and the 
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Table 1. Representations in function of FFreCk and the groups in B{FFreek)- 
DFreeGenRep denotes the disjunction-free generators representation, GDFreeGenRep 
the generalized disjunction-free generators representation. 



Representation 


Base with frequency without frequency 


FreeRep 


FFreei ui 


DFreeRep 

DFreeGenRep 


FFree2 complete border 

FFree2 flu2 iu2 


GDFreeRep 
GDFree G enRep 


F Free 00 complete border 

FFreCoo fUoo,i iuoo,! 


NDIRep 


FFreCoo f^Uoo^flUoo 



minimal support threshold. Hence, we can think of the different groups as being 
equivalence classes. We will now concentrate on which of these classes have to 
be stored to get a minimal representation. 

Instead of storing the complete border in a representation, we can restrict 
ourselves to only some of the groups. It is, for example, not necessary to store 
the groups flu and ilu, because every set / in these two groups has supp{I) = 
LBk{I) = UBk{I), and thus, its support is derivable. Furthermore, it is not 
necessary to store the sets in ilu, cilu, and cilu, because these sets have UBk{I) < 
s and thus are certainly infrequent. In Figure 1, the groups which cannot be 
excluded directly are indicated with boxes. The other groups can always be 
reconstructed, based on FFreCk- 

Notice that for all these groups, there is no need to store the supports of 
the sets in it. For example, for fluk all sets I in fluk have supp{I) = LBk{I)- 
Hence, we can derive the support of a set / if we know that / is in fluk- Similar 
observations hold for the other groups as well. In the proposed representations, 
each group is stored separately. 

We can reduce the number of groups even more. For some subsets Q = 
{gi, . . . , pn} of the remaining groups {fluk,cfluk, ufluk,uiluk, uiluk}, the struc- 
ture 

{{{I,supp{I)) I I G FFreek},gi,..., 9 n) 

will be a representation, and for some groups Q will not. We denote the structure 
associated with Q and rules up to depth k with Sk{G)- 

The structure Sk{{fluk,cfluk}) is a representation for every k, but neither 
Skiifluk}), nor Sk{{cfluk}) are. Hence, Sk{{fluk,cfluk}) is a minimal repre- 
sentation among the representations Sk{G)- The only minimal sets of groups G 
such that the associated structures are representations are: 

Gi = {flu,uflu} ,_ G 2 = {cfl_u,uflu} ,_ 

Gs = {flu,uilu,uilu} , and G 4 = {cflu,uilu,uilu} . 



Theorem 2. [7] Let G Q {flu,cflu,uflu,uilu,uilu}. Sk{G) is a representation 
if and only if either Gi ^ G, or G 2 Q G, or Gs ^ G, or G 4 ^ G- 
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{{l,supp{l)) \ien 



SiiGa) 




NDIRep 

ClosedRep 





GDFreeGenRep 



•SooiGs) 



less concise > more concise SooiGi) 

Fig. 2. Relation between the different representations. 



For the proof we refer to [7]. The theorem implies that representations 5oo(Gi), 
'5 oo(G2),‘5oo(G 3), and 5oo(G4) are minimal. Thus, all representations in Table 1, 
have at least one Sk{G) that is more concise. The relations between the different 
representations are given in Figure 2. For proofs of the relations see [7]. 

5 Experiments 

To empirically evaluate the newly proposed concise representations, we experi- 
mented with several database benchmarks used in [16]. Due to space limitations, 
we only report results for the BMS-Webview-1 dataset, containing 59 602 trans- 
actions, created from click-stream data from a small dot-com company which 
no longer exists [17], and the pumsb* dataset, containing 100 000 transactions 
from census data from which items that occur more frequently than 80% are re- 
moved [3]. Each experiment finished within minutes (mostly seconds) on a IGHz 
Pentium IV PC with 1GB of main memory. 

Figure 3 shows the total number of itemsets that is stored for each of the 
four new representations, together with the previously known minimal represen- 
tations, i.e., the non-derivable itemsets, the closed itemsets, and the generalized 
disjunction- free generators. 

In both experiments, the representations 5oo(l/i) and Soo{G 2 ) have more or 
less the same size. This is not very surprising, since the parts of the border these 
two representations store have a big overlap. Also the representations SaoiGs) 
and 5 oo(^/ 4) are almost equal in size. Again we see that Gs and Gi are almost 
equal. 

Notice also that for BMS-Webview-1 the representations GDFreeGenRep and 
SaoiGs) have the same size. The reason for this can be found in Figure 2. In this 
figure we see that the size of GDFreeGenRep is between the sizes of 52 ( 1 / 3 ) and 
SooiGs)- Therefore, the fewer rules of depth more than 2 that need to evaluated 
in order to get optimal bounds, the closer GDFreeGenRep will be to Soc,{Gz)- In 
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Fig. 3. Number of sets in concise representations for varying minimal support. 
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Fig. 4. Number of sets in concise representations of BMS-Webview-1 for varying rule 
depth. 



Figure 4, the effect of varying rule depth is given. The plot shows the sizes of the 
representations Sk{Gi) for different values of k. For the BMS-Webview-1 dataset, 
evaluating rules of depth greater than 2 does not give any additional gain. In 
the pumsb* dataset, some gain is still achieved with rules of depth 3. Hence, in 
the BMS-Webview-1 dataset, GDFreeGenRep and SaoiGs) have similar size, and 
in the pumsb*-dataset, there is a slight difference in the part of the border that 
is stored. In the BMS-Webview-1 dataset, the total number of sets in represen- 
tations 5oo(I/i) and Soo{G 2 ) is smaller than all other representations, except for 
the closed sets. However, in the pumsb* dataset, the closed set representation 
is much larger than all others. As can be seen, SadGs) and 5oo(I/4) sometimes 
contain more sets, which was expected since these representations also include 
infrequent sets. 
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Additionally, to get these results, only rules up to depth 3 were needed to 
be evaluated. This is illustrated in Figure 4, in which we plotted the size of the 
condensed representation for varying rule depth. 

Also, for all other experiments almost no additional gain resulted from eval- 
uating rules of depth larger than 3. As a consequence, the additional effort to 
evaluate only these rules is almost negligible during the candidate generation 
of the frequent set mining algorithm. Indeed, for every itemset I, at most ) 
rules need to be evaluated, each containing at most three terms. 
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Abstract. One basic goal in the analysis of time-series data is to find 
frequent interesting episodes, i.e, collections of events occurring 
frequently together in the input sequence. Most widely-known work de- 
cide the interestingness of an episode from a fixed user-specihed window 
width or interval, that bounds the length of the subsequent sequential as- 
sociation rules. We present in this paper, a more intuitive definition that 
allows, in turn, interesting episodes to grow during the mining without 
any user-specihed help. A convenient algorithm to efficiently discover the 
proposed unbounded episodes is also implemented. Experimental results 
conhrm that our approach results useful and advantageous. 



1 Introduction 

A well-defined problem in Knowledge Discovery in Databases arises from the 
analysis of sequences of data, where the main goal is the identification of fre- 
quently-arising patterns or subsequences of events. There are at least two related 
but somewhat different models of the sequential pattern mining. In one of them 
each piece of data is a sequence (such as the aminoacids of a protein, the banking 
operations of a client, or the occurences of recurrent illnesses), and one desires to 
find patterns common to several pieces of data (proteins with similar biological 
functions, clients of a similar profile, or plausible consequences of medical deci- 
sions). See [2] or [7] for an introduction to this model of a sequential database. 
The second model of sequential pattern matching is the slightly different ap- 
proach proposed in [6], where data come in a single, extremely long stream, e.g. 
a sequence of alarms in a telecommunication network, in which some recurring 
patterns, called episodes, are to be found. 

Both problems seem similar enough, but we concentrate here on the second 
one of finding episodes in a single sequence. Abstractly, such ordered data can 
be viewed as a string of events, where each event has an associated time of 
occurrence. An example of an event sequence is represented in Figure 1. Here A, 
B and C are the event types, such as the diferent types of user actions marked 
on a time line. 

* This work is supported in part by EU ESPRIT IST-1999-14186 (ALCOM-FT), and 
MCYT TIC 2002-04019-C03-01 (MOISES) 
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A 

— I- 

20 



C 



24 

Fig. 1. A sequence of events 



B 

— I- 

98 



We briefly describe the current approaches to interesting episodes, point out 
some disadvantages, and then propose, as our main contribution, an alternative 
approach for deflning a new kind of serial episodes, i.e unbounded episodes. We 
Anally explain how previous algorithms for finding frequent sets can be applied 
to our approach, and suggest an interpretation of parallel episodes as summaries 
of serial episodes, with the corresponding algorithmic consequences. Finally, we 
describe the results of a number of preliminary experiments with our proposals. 

2 Framework Formalization 

To formalize the framework of the time-series data we follow the terminology, 
notation, and setting of [ 6 ]. The input of the problem is a sequence of events. 
Given a set E of event types, an event is a pair {A, t) where A G E is an event 
type and t is its occurrence time. 

An event sequence is a triple (s, Tg, T^), where Tg is called the starting time of 
the sequence, is the ending time, and s has the form: s = ((Ai, G!), .., (!A„, t„)) 
where Ai is an event type, and ti is the associated occurrence time, with Tg < 
ti < ti+i < Tg for all z = 1, . . . , n — 1. The time ti can be measured in any time 
unit, since this is actually irrelevant for our algorithms and proposals. 

2.1 Episodes 

Our desired output for each input sequence is a set of frequent episodes. An 
episode is a partially ordered collection of events occurring together in the given 
sequence. Episodes can be described as directed acyclic graphs. Consider, for 
instance episodes a, f3 and 7 in Figure 2. Episode a = B ^ C is a serial episode: 
event type B occurs before event type C in the sequence. Of course, there can be 
other events occurring between these two in the sequence. Episode f3 = {A, B} 
is a parallel episode: events A and B occur frequently close in the sequence, but 
there are no constraints about the order of their appearences. Finally, episode 7 
is an example of hybrid episode: it occurs in a sequence if there are occurences 
of A and B and these precede an occurrence of C, possibly, again, with other 
intervening events. 

More formally, an episode can be defined as a triple {V,<,g) where: F is a 
set of nodes, < is a partial order relation on V, and g : V E is a mapping 
associating each node with an event type. We also define the size of an episode 
as the number of events it contains, i.e, \V\. The interpretation of an episode is 
that events in g(V) must occur in the order described by <. In this paper we 
will only deal with serial and parallel episodes. 
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Definition 1. An episode (3 = (V',<',g') is a subepisode of a = (V,<,g), 
noted by (3 Q a, if there exists an injective mapping f : V' ^ V such that g'{v) = 
g{f{v)) for all v € V', and for all v,w €V' with v <' w also f{v) < f{w). 



3 Classical Approaches to Define Interesting Episodes 

In the analysis of sequences we are interested in finding all frequent episodes 
from a class of episodes which can be interesting to the user. In this section 
we will mainly take the classical widely-used work of [6] as a reference, i.e, we 
state that to be considered interesting, the events of an episode must occur close 
enough in time. 



3.1 Winepi 



In the first approach of [6] , the user defines how close the events of an interesting 
episode should be by giving the width of the time window within which the 
episode must occur. The number of possible windows of a certain width win in 
the sequence (s,Tg,Te) is exactly: Te — Tg + win—l, and we denote by W{s,win) 
the set of all these windows of size win. Thereby, the frequency of an episode a 
in s is defined to be: 



fr{a, s, win) 



|{w G W{s,win)\a occurs in w} 
|IT(s, win)\ 



So, an episode is frequent according to the number of windows where that episode 
has occured, or to its ratio to the total number of possible such windows in the 
sequence. To be frequent, the ratio fr{a,s,win) of an episode must be over 
a minimum user-specified real value. The Winepi approach applies the Apriori 
algorithm to find the frequency of all the candidate episodes in the sequence as 
though each sliding window were a transaction with ordered events. 

Once the frequent interesting episodes are discovered from the sequence, the 
second goal of the approach is to create the episode association rules that 
hold over a certain minimum confidence. For all episodes /? C a, an episodal 
rule (3 ^ a holds with confidence: 



conf{(3 a) 



fr{a, s, win) 
fr{f3, s, win) 
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3.2 Minepi 

Minepi is based on minimal occurrences of episodes in a sequence. For each 
frequent episode, the algorithm finds the location of its minimal occurrences. 
Given an episode a and an event sequence s, we say that the interval w = te) 
is a minimal occurrence of a in s, if: 

(1) a occurs in the window w. 

(2) a does not occur in any proper subwindow on w. 

Basically, the applied algorithm is Apriori: it locates, for every episode going 
from the smaller ones to larger ones, its minimal occurrences. In the candidate 
generation phase, the location of minimal occurrences of a candidate a is com- 
puted as a temporal join of the minimal occurrences of two subepisodes of a. 

This approach differs from Winepi in the fact that it does not use a frequency 
ratio to decide when an episode is frequent. Instead, an episode will be considered 
frequent when its number of minimal occurrences is over an integer value given 
by the user. This is a consequence of the fact that the lengths of the minimal 
occurrences vary, so that a uniform ratio could be misleading. One advantage of 
this approach is that allows the user to find final rules with two windows widths, 
one for the left-hand side and one for the whole rule, such as “if A and B occur 
within 15 seconds, then C follows withing 30 seconds”. So, in this approach an 
episode association rule is an expression !3\wini] a[win 2 ], where f3 and 
a are episodes such that (3 C a, and win\ and win 2 are integers specifying 
interval widths. The informal interpretation of the rule is that if episode [3 has a 
minimal occurrence at interval [ts,te) with te~ts < wini, then episode a occurs 
at interval [ts,t'e) for some such that — wiu 2 - 

The confidence of an episode association rule (3[wini\ a[win 2 \ with 
P Q a and two user-specified interval widths win\ and win 2 is the following: 

conf{P[wini] a[win 2 ]) = 

\{[ts s.t te — ts^wini and [ts ,ts-\-win 2 )Gmo{(x)}\ 

\{[ts-,te)^'mo{(3) and ts — te^wini}\ 

where mo{a) are the set of minimal occurrences of the episode a in the original 
input sequence. So, even if there is no fixed window size (as occurred in Winepi 
approach) and apparently minimal occurrences are not restricted in length, now 
the user needs to specify the time bounds win\ and win 2 for the generation of 
the subsequent episode rules and their confidences. These values force minimal 
occurrences to be bounded in a fixed interval size of at most wiu 2 time units 
during the mining process. 

3.3 Some Disadvantages of These Previous Approaches 

We summarize below some of the observed disadvantages in Winepi and Minepi. 

— In Winepi the window width is fixed by the user and it remains fixed through- 
out the mining. Consequently, the size of the discovered episodes is limited. 
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Winepi just reduces the problem of mining the long event sequence to a 
sequential database (such as in [2]), where now each transaction is a fixed 
window. 

— In Minepi the user specifies two time bounds for the creation of the sub- 
sequent episode association rules. These intervals make the final minimal 
occurrences to be bounded in size, since just those occurrences contained 
within the bounds are counted. 

— Both Minepi and Winepi require the end user to fix one parameter with 
not much guidance on how to do it. Intervals or windows too wide can lead 
to misleading episodes where the events are widely separated among them; 
so, the subsequent rules turn out to be uninformative. On the other hand, 
interval or windows set too tight give rise to overlapping episodes: if there 
exists an interesting episode, a, whose size is larger than the fixed window 
width, then that episode will never fit in any window and, consequently, a 
will be discovered just partially. 

— Minepi does not use a frequency ratio to decide whether an episode is fre- 
quent. This makes difficult the application of sampling in the algorithms of 
finding frequent episodes. 

— In case the user decides to find the episode association rules for a different 
time bound (a different window size in Winepi or a different interval length 
for Minepi), then the algorithm that finds the source of frequent episodes 
has to be run again, incurring in a inconvenient overhead. 

— Both approaches do not seem truly compatible for those problems where the 
adjancency of the events in the discovered episodes is a must (such as protein 
function identification). Neither Winepi or Minepi allow to set this kind of 
restriction between the events of an interesting episode. 

4 Unbounded Episodes 

In order to avoid all these drawbacks and be able to enlarge the window width au- 
tomatically throughout the mining process, we propose the following approach. 

We will consider a serial or parallel episode interesting if it fulfills the following 

two properties: 

(1) Its correlative events have a gap of at most tus time units (see figure 3). 

(2) It is frequent. 



tus tus 
B^C 

a 



tus tus 
{A, B, C} 

P 



Fig. 3. Example of serial and parallel unbounded episodes 



So, in our proposal, the measure of interestingness is based on tus, the time- 
unit separation between correlative events in the episode. This number of 
time units must be specified by the user. The above two episodes a and f3 are 
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examples of the interpretation of our approach. In the serial episode a, the 
distance between A and B is tus, and the distance between B and C is also tus 
time units. Besides, despite not specifying the distance between events A and C, 
it can be clearly seen they are at most 2 x tus time units away. In the parallel 
episode P, distance between correlative events A, B and C, regardless of the 
order of their appearences in the sequence, must be of at most tus time units. 
More generally, an episode of size e may span up to (e — 1) x tus time units. 

Now, every episode that is candidate to be frequent, will be searched in 
windows whose width will be delimited by the episode size: an episode with e 
events will be searched in all windows in the sequence of width (e — 1) x tus time 
units. Thus, the window width is not bounded, nor is the size of the episode, 
and both will grow automatically, if necessary, during the mining. This explains 
the name chosen: we are mining unbounded episodes. 

At this point, it is worth mentioning the work of [8] (contributing with the 
algorithm cSPADE) and [7] (the algorithm GSP). These two papers integrate 
inside the mining process the possibility to define a max-gap constraint between 
the elements of the frequent sequences found in a sequential database. However, 
this max-gap constraint in [8] or [7] does not lead to an unbounded class of 
patterns as we present here. The reason is that they work on the sequential 
database problem, and so, the frequent mined patterns turn out to be naturally 
bounded by the lenght of the transactions in the database. 

With our approach the window width is allowed to grow automatically with- 
out any predetermined limits. The frequency of an episode can be defined in 
the following way: let us denote by Wk{s,win) the total set of windows in a 
sequence (s,Ti,Tf) of a fixed width win = k x tus time units (the number of 
such windows in the sequence is — Tf + win — 1). Then: 



Definition 2. The frequency of an episode a of size k+1 in a sequence{s, Ti, Tf) 
is: 



fr{a, s, tus) 



lire S Wk{s,win)\a occurs in w}| 
\Wk{s,win)\ 



where win = (|a| — 1) x tus = k x tus. 



Note that the dependence on win, for fixed a, is here simply a more natural 
way to reflect the dependence on the user-supplied parameter tus, but both 
correspond to the same fact since win and tus are linearly correlated. 

To sum up, every episode a will be frequent if its frequency is over a minimum 
user-specified frequency, that is, according to the number of windows in which 
it occurs; however, the width of that window depends on the number of events 
in a. So, the effect in the algorithm is that, as an episode size becomes bigger 
and the number of its events increases, the proper window in which that episode 
is searched also increases its width; and simultaneously the ratio that has to be 
compared with the user-specified desired frequency is appropriately adjusted. 
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4.1 Episode Association Rule with Unbounded Episodes 



The approach of mining unbounded episodes will be flexible enough to allow the 
generation of association rules according to two interval widths (one for the left 
hand side, and one for the whole rule as occurred with Minepi). 

An unbounded episode rule will be an expresion (3[ni] a[rir], where f3 
and a are unbounded episodes such that (3 Q a, and n; and are integers such 
that ni = \(3\ and Ur = |o;| — \(3\. The informal interpretation of these two new 
variables ni and Ur is the number of events occurring in the left hand side (nj) 
and new events implied in the right hand side (rir) of the rule respectively. 

So, we can rewrite any unbounded episode rule P[ni] a[rir] in terms of a 
rule with two window widths j3[wi\ a[w 2 ] by considering wi = (jii — 1) x tus 
and W 2 = rir X tus. This transformation will lead to an easy and informative 
interpretation of the rule: “if events in [3 occur within w\ time units, then, the 
rest of the events in a will follow within W 2 time units” . 

One of the advantages of this proposed approach is that focusing our episode 
search on the time-unit separation between events, will allow to generate the 
best unbounded episode rule I3[ni] a[nr] (and so, the best rule (3[wi] a[w 2 \) 

without fixing any other extra parameter: neither ni or Ur will be user-specified 
for any rule, since these values will be chosen from the best antecedent and 
consequent maximizing the value of confidence for that rule (or in other words, 
ni and will be uniquely determined by the size of the episode being the 
antecedent and the size of the episode being the consequent in the best rule 
according to confidence ratio). 

Since in our approach we have a ratio of frequency support, we can define 
the confidence of a rule /3 => a for /3 C a as: 



con f {13 a) 



fr{a, s, tus) 
fr{(3,s,tus) 



where the value of fr{a, s, tus) for a fixed a, depends on the occurrences of a 
in all windows of lenght (|a| — 1) x tus in the sequence. Note that since (3 is 
a subepisode of a, the rule right-hand side a contains information about the 
relative location of each event in it, so the “new” events in the rule right-hand 
can actually be required to be positioned between events in the left-hand side. 
The rules defined here are also rules that point forward in time (rules that point 
backwards can be defined in a similar way). 

As we see, the values ni and Ur of a rule do not affect the confidence, and they 
can be determined after having chosen the best rule by following the procedure: 

for each maximal episode a, 

/3[|/3|] q;[|q;| — |/3|] = arg.max{conf{/3 a) s.t P C a} 



So, the final windows widths (ici = \P\ x tus and W 2 = (|o;| — |/3|) x tus) are 
determined by the best rule in terms of confidence, and this can vary from one 
rule to the other, adapting always to the best combination. Note that instead of 
confidence, any other well-defined metric for episodes could be used to select the 
best rule in this procedure, and so, different unbounded rules would be taken. 
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ABC BAC BAC 



Fig. 4. Example of an event sequence 



Example in figure 4 will serve to illustrate the advantatges of our unbounded 
episode approach. The sequence of this figure shows that we could consider 
frequent the episodes: (3 = {A, B} and 7 = {A, B} — >• C (as they are represented 
as a graph in figure 2). The best association rule we can find in this example is 
the following: {A, B} C, that should have a confidence of 1 for this presented 
piece of sequence. 

For Winepi, at least a fixed window of 5 time units of width should be 
specified to find both j3 and 7. But this parameter depends on the user and 
it is not intuitive enough to chose the right value. In this example, if the user 
decides a window width of 3 time units, then the episode 7 would never be fully 
discovered and the rule will never be generated. 

With Minepi, the problem comes when specifying the two windows widths 
for the episode association rule. In case the user decides wirii = 3 and win 2 = 4, 
the generated rule would be /3[3] 7(4], that has a confidence of just 1/3 in 

this example. It is not the best association rule, and it is due to the value of 
win 2 = 4, that it is set too tight. Besides, if the user wants to specify a wider 
win 2 , the algorithm finding frequent minimal occurrences has to be run again. 

For the unbounded approach however, one would find both (3 and 7 by just 
specifing a big enough value for tus. This is an intuitive parameter, and the 
best subsequent episode rule in terms of confidence would be [3[2] => 7(1], with 
a confidence of 1. This rule can be transformed in terms of two window widths 
and interpret the following “if A and B occur within tus time units, then C will 
follow in next tus time units” . 



4.2 Advantages of Our Approach 

We shortly summarize some advantages of our unbounded proposal. 

— Since the window increases its width along with the episode size, the final 
frequent episodes do not overlap unnecesarily, and their size is not limited. 

— The unbounded episode rules have better quality in terms of confidence 
without any previous user help. 

— Unbounded episodes generalizes Minepi and Winepi in that episodes found 
with a window width of x time units can be found with our approach using 
a distance of a; — 1 time units between correlative events. 

— The application of sampling techniques are allowed. 

— Once the frequent unbounded episodes are mined, finding the episode rules 
with two windows widths is straight. What is more, the user can try different 
windows widths for the rules, and chose the best width according to some 
statistical metric. This does not affect the previous mining and the discove- 
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red unbounded episodes, and they are always the same once we are in the 
generating rule phase. 

— Our proposal can be adapted to the sequential-database style by imposing 
a wide gap between the different pieces of data. 

On the whole, we can say that unbounded episodes are more general and 
intuitive than Minepi or Winepi approaches. In particular, these unbounded 
episodes can be very useful in contexts such as the classification of documents 
or the intrusion detection systems. As argued in [5], a drawback of subsequence 
patterns is that they are not suitable for classifying long strings over small al- 
phabet, since a short subsequence pattern matches with almost all long strings. 
So, the larger the episodes found in a text the better for the future predictions. 



5 Algorithms to Mine Unbounded Episodes 

Our proposed definition of unbounded episodes is flexible enough to still allow 
the use of previous algorithms. Besides, to prove the flexibility of the proposal, 
we also adapt here our strategy from [3], Best-First strategy, which is a non- 
trivial evolution of Dynamic Itemset Counting (DIG, [4]) and provides better 
performance than both Apriori and DIG. For better understanding, we give a 
brief account of how our Best-First strategy works. 

Similarly to DIG, our strategy keeps cycling through the data as many times 
as necessary, counting the support of a number of candidate itemsets. Whenever 
one of them reaches the threshold that declares it frequent, it immediately “noti- 
fies” this fact to all itemsets one unit larger than it. In this way, potential future 
candidates keep being informed of whether each of their immediate predecessors 
is frequent. When all of them are, the potential candidate is promoted to can- 
didate and its support starts to be counted. DIG follows a similar pattern but 
only tries to generate new candidates every M processed transactions: running it 
with M = 1 would be similar to Best-First strategy, but would incur overheads 
that our algorithm avoids thanks to the previous online information of which 
subsets of the potential candidates are frequent at each moment. 

To follow the same structure, our new algorithm for mining episodes, called 
Episodal Best-First (EpiBF), will distiguish two sets of episodes: 1/ candidate 
episodes whose frequency is being counted, and 2/ potential candidates that 
will be incorcorated as candidates as soon as the monotonicity property of fre- 
quency is fulfilled. Hovewer, given that now we are using our unbounded ap- 
proach of interestingness, we must relax the monotonicity property frequency 
for pruning unwanted candidates. Other Breadth-First algorithms, like Apriori 
or DIG, can be also easily applied by taking into account that at each new scan 
of the database for candidates of size k, the window width must be incremented 
conveniently (i.e, (fc — 1) x tus). Apart from that, we also have to relax here the 
monotonicity property of frequency used in the candidate generation phase as 
we will see in short. We discuss separately the case of serial episodes first. 
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5.1 Discovering Serial Episodes 

In case of using our approach of unbounded episodes, the well-known mono- 
tonicity property of frequency (stating that any frequent episode has all its 
subepisodes also frequent) does not hold: that is, a frequent unbounded episode 
could have some subepisode not frequent. For instance, let us consider the un- 
bounded serial episode A ^ B ^ C, and its subepisode A ^ C. They refer to 
two different classes of unbounded interestingness: while in A — >■ B — >■ C, events 
A and C are separated for at most 2 x tus time units, in its subepisode A^ C 
the events are separated at most tus time units. So, it might well be that since 
the gap between events is different in both episodes, A — >• C is not frequent 
while A ^ B ^ C IS frequent. We cannot use this property to prune unwanted 
candidates. 

But we will relax this notion here and we will just consider those subepisodes 
whose events follow an adjacency of tus time-unit separation. For instance, to 
consider the episode A ^ B ^ C a, good candidate that deserves to be counted 
in the data, one has to find frequent just the subepisodes A ^ B and B ^ C 
(i.e, the overlapping parts of an unbounded episode). Then, it is true that any 
frequent unbounded episode has all its overlapping parts frequent. 

Now, the algorithm EpiBF for serial episodes goes in the following way. It 
starts by initializing the set of candidate episodes with all episodes of size 2, 
and the set of potential candidates with all episodes of size 3. Then, it goes on 
counting the frequency of all the candidate episodes until this set becomes empty. 
When one of these candidate episodes of size k achieves the state of frequent, it 
increments counters corresponding to all the potential candidates of size fc -I- 1 
that we can obtain by adding one more event before it or after it. This growth 
leads to unbounded episodes. On the other hand, when a potential candidate 
of size fc -I- 1 finds that both subepisodes of size k, obtained by chopping off 
either end, have been declared frequent, then it will be incorporated in the set 
of candidate episodes. 

It is important to highlight that, in this algorithm, the set of candidate 
episodes can be made up of episodes of diferent sizes, and each episode a of size 
k must be searched and counted in all windows of width (k — l)x tus time units. 
This fact forces EpiBF to handle windows of different sizes at the same time by 
simply taking, at every step, the largest window for the longest episode in the 
set of candidate episodes. The rest of episodes in the set of candidate episodes 
will be searched in the proper subwindows. 

5.2 Discovering Serial and Parallel Episodes Simultaneously 

In case of mining parallel episodes the problem can be reduced efficiently to 
mining serial episodes in the following way. Every parallel episode of size k 
lumps together up to fc! serial episodes. For instance, the parallel episode {A, B} 
gathers the following two serial episodes: A ^ B and B — >■ A. In this case, a 
serial episode will be called participant of a parallel episode. Clearly, any serial 
episode is participant of one, and only one, parallel episode. 
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Let us discuss what could be the meaning of parallel episode mining. Clearly, 
if a frequent parallel episode has some (but not all) participants already frequent, 
the desired output is the list of such frequent serial episodes: the parallel one, 
given alone, provides less information. In such cases we should not move from 
the serial episodes to the parallel one, unless actually all of them are frequent: 
in this last case, the parallel episode is an effective way of representing this fact. 
Thus, according to our proposal, in order to be considered interesting, a parallel 
episode a must fullfil one of the two following conditions: either 

1. by adding up the frequency of the serial episodes that are participants of a, 
we reach the user-specified minimal frequency, but no serial episode partici- 
pant of a is frequent alone; or 

2. every serial episode participant of a is frequent. 

From the point of view of the algorithm, any used strategy will mine serial 
episodes, but these serial episodes can refer to parallel ones too. Thereby, the 
set of candidate episodes will be made of serial ones, while the set of potential 
candidates will be composed of parallel episodes. This means that the algorithm 
will be counting the support of serial candidates, as in the previous case; however, 
when declaring one of these serial episodes, a, frequent or non-frequent, the 
notification must go to that parallel episode which a is participant of. 

6 Experiments 

In this section we present the results of running (a probabilistic version of the) 
EpiBF algorithm on a variety of different data collections. 

First, we experimented, as in [6], with protein sequences. We used data in 
the PROSITE database of the ExPASy WWW molecular biology server of the 
Geneva University Hospital and University of Geneva [10]. The purpose of this 
experiment is to identify specific patterns in sequences so as to determine to 
which family of protein they belong. The sequences in the family we selected 
(“DNA mismatch repair proteins I”, PROSITE entry PS00058, the same one 
used in [6] for comparison), are known to contain the string GFRGEAL. This string 
represents a serial episode of seven consecutive symbols separated by 1 unit of 
time among them. Parameter tus was set to 1, and the support threshold was set 
to 15, for the 15 individual sequences in the original data. Note that no previous 
knowledge of the pattern to be found is involved in this parameter setting. 

As expected, we found in the database the pattern GFRGEAL along with 3,755 
more serial episodes (whether maximal or not), most of them much shorter. 
When comparing our approach against previous ones, we see that both Winepi 
and Minepi need to know in advance the length of the expected pattern in the 
protein sequence, in order to fix the window width. However, it is usual that we 
do not know which pattern is to be found in a sequence; so, one must try the 
experiment with different window widths. 

In order to see the flexibility of the unbounded episodes, we also run experi- 
ments with text data collections. In particular, we used a part of a text extracted 
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from “Animal Farm” by Orwell [9]. Once again, setting tus close to 1 and consid- 
ering each letter a new event, we are able to find frequent prepositions, articles, 
suffixes of words, and concatenations of words (such as “to”, “in”, “at”, “ofthe”, 
“was” , “her” , “ing” . . . ) . This experiment could have been done considering an 
event to be every new word in the text; this will lead to unbounded episodes as 
a tool to classify other new texts. 

When it comes to the general performance of the method, we found that, 
naturally, the larger the value of the parameter tus, the more discovered episodes. 
Besides, discovering our serial and parallel episodes simultaneously, allows the 
algorithm to discover parallel patterns when hardly serial patterns are found in 
the database. For example, fed with the first 40,000 digits of the Champernowne 
sequence (012345678910111213141516...), with a high frequency threshold of 
50% and digits far apart at most 15 positions in the episodes {tus = 15), only 3 
serial episodes were found but we discovered 15 other parallel episodes. 

7 Conclusions 

We present in this paper a more intuitive approach for interesting episodes. This 
proposal overcomes the disadvantages of the widely-used previous approaches 
(Minepi and Winepi), and it turns out to be an adaptative approach for catego- 
rical time-series data. The algorithmic consequences of the unbounded episodes 
are also discused and implemented. Finally, we have also introduced a new way 
of considering parallel episodes as a set of participant serial episodes. First expe- 
riments prove to be promising, but more experimentation on the different values 
of tus and their consequences in the subsequent rules is on the way. 
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Abstract. In this paper we propose an extension of the naive Bayes classifica- 
tion method to the multi-relational setting. In this setting, training data are 
stored in several tables related by foreign key constraints and each example is 
represented by a set of related tuples rather than a single row as in the classical 
data mining setting. This work is characterized by three aspects. First, an inte- 
grated approach in the computation of the posterior probabilities for each class 
that make use of first order classification rules. Second, the applicability to both 
discrete and continuous attributes by means a supervised discretization. Third, 
the consideration of knowledge on the data model embedded in the database 
schema during the generation of classification rules. The proposed method has 
been implemented in the new system Mr-SBC, which is tightly integrated with 
a relational DBMS. Testing has been performed on two datasets and four 
benchmark tasks. Results on predictive accuracy and efficiency are in favour of 
Mr-SBC for the most complex tasks. 



1 Introduction 

Many inductive learning algorithms assume that the training set can be represented as 
a single table, where each row corresponds to an example and each column to a pre- 
dictor variable or to the target variable Y. This assumption, also known as single- 
table assumption [23], seems quite restrictive in some data mining applications, where 
data are stored in a database and are organized into several tables for reasons of effi- 
cient storage and access. In this context, both predictor variables and the target vari- 
able are represented as attributes of distinct tables (relations). 

Although in principle it is possible to consider a single relation reconstructed by 
performing a relational join operation on the tables, this approach is fraught with 
many difficulties in practice [2,11]. It produces an extremely large, and impractical to 
handle, table with lots of data being repeated. A different approach is the construction 
of a single central relation that summarizes and/or aggregates information which can 
be found in other tables. Also this approach has some drawbacks, since information 
about how data were originally structured is lost. Consequently, the (multi-)relational 
data mining approach has been receiving considerable attention in the literature, espe- 
cially for the classification task [1,10,15,20,7]. 

In the traditional classification setting [18], data are generated independently and 
with an identical distribution from an unknown distribution P on some domain X and 
are labelled according to an unknown function g. The domain of g is spanned by m 
independent (or predictor) random variables Z (both numerical and categorical), that 
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is X=XjXX 2 X...xXjj,, while the range of g is a finite set J^={Cj, Q, Q}, where each 
Cj is a distinct class. An inductive learning algorithm takes a training sample 5={(x, y) 
G X X Y I y=g(x) } as input and returns a function / which is hopefully close to g on 
the domain X. A well-known solution is represented by the Naive Bayesian Classifi- 
ers [3], which aim to classify any xe~S. is the class maximizing the posterior prob- 
ability P( C\x) that the observation x is of class C^, that is: 

f(x)= arg max, P( C\x) 

By applying the Bayes theorem, P(C\x) can be reformulated as follows: 

p(» 

where the term P(x\CJ is in turn estimated by means of the naive Bayes assumption: 
P(x\C)=P(x,,x,,... ,x}C)=P(x\C) xP{x,\C) x...xP(x}C) 

This assumption is clearly false if the predictor variables are statistically depend- 
ent. However, even in this case, the naive Bayesian classifier can give good results 

[3]. 

In this paper we present a new approach to the problem of learning classifiers from 
relational data. In particular, we intend to extend the naive Bayes classification to the 
case of relational data. Our proposal is based on the induction of a set of first-order 
classification rules in the context of naive Bayesian classification. 

Studies on first-order naive Bayes classifiers have already been reported in the lit- 
erature. In particular, Pompe and Kononenko [20] proposed a method based on a 
two-step process. The first step uses the ILP-R system [21] to learn a hypothesis in 
the form of a set of first-order rules and then, in the second step, the rules are prob- 
abilistically analyzed. During the classification phase, the conditional probability 
distributions of individual rules are combined naively according to the naive Bayesian 
formula. 

Flach and Lachiche proposed a similar two-step method, however, unlike the pre- 
vious one, there is no learning of first-order rules in the first step. Alternatively, a set 
of patterns (first-order conditions) is generated that are used afterwards as attributes in 
a classical attribute-value naive Bayesian classifier [7]. IBC, the system implement- 
ing this method, views individuals as structured objects and distinguishes between 
structural predicates referring to parts of individuals (e.g. atoms within molecules), 
and properties applying to the individual or one or several of its parts (e.g. a bond 
between two atoms). An elementary first-order feature consists of zero or more 
structural predicates and one property. 

An evolution of IBC is represented by the system 1BC2 [16], where no prelimi- 
nary generation of first-order conditions is present. Predicates whose probabilities 
have to be estimated are dynamically defined on the basis of the individual to classify. 
Therefore, this is a form of lazy learning, which defers processing of its inputs (i.e., 
the estimation of the posterior probability according to the Bayesian statistical frame- 
work) until it receives requests for information (the class of the individual). Computed 
probabilities are discarded at the end of the classification process. Probability esti- 
mates are recursively computed and problems of non-termination in the computation 
may also occur. 

An important aspect of the first two {eager) approaches is that they keep the phases 
of first-order rules/conditions generation and of probability estimation separate. In 
particular, Pompe and Kononenko use ILP-R to induce first-order rules [21], while 
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IBC uses TERTIUS [8] to generate first order features. Then, the probabilities are 
computed for each first-order rule or feature. In the classification phase, the two ap- 
proaches are similar to a multiple classifier because they combine the results of two 
algorithms. However, most first-order features or rules share some literals and this 
approach takes into account the related probabilities more than once. To overcome 
this problem it is necessary to rely on an integrated approach, so that the computation 
of probabilities on shared literals can be separated from the computation of probabili- 
ties on the remaining literals. 

Systems implementing one of the three above approaches work on a set of main- 
memory Prolog facts. In real-world applications, where facts correspond to tuples 
stored on relational databases, some pre-processing is required in order to transform 
tuples into facts. However, this has some disadvantages. First, only part of the original 
hypothesis space implicitly defined by foreign key constraints can be represented after 
some pre-processing. Second, much of the pre-processing may be unnecessary, since 
a part of the hypothesis described by Prolog facts space may never be explored, per- 
haps because of early pruning. Third, in applications where data can frequently 
change, pre-processing has to be frequently repeated. Finally, database schemas pro- 
vide the learning system free of charge with useful knowledge of data model that can 
help to guide the search process. This is an alternative to asking the users to specify a 
language bias, such as in IBC or 1BC2. 

A different approach has been proposed by Getoor [13] where the Statistical Rela- 
tional Models (SRM) are learnt taking advance from the tightly integration with a 
database. SRMs are models very similar to Bayesian Networks. The main difference 
is that the input of a SRM learner is the relational schema of the database and the 
tuples of the tables in the relational schema. 

In this paper the system Mr-SBC (Multi-Relational Structural Bayesian Classifier) 
is presented. It implements a new learning algorithm based on an integrated approach 
of first-order classification rules with naive Bayesian classification, in order to sepa- 
rate the computation of probabilities of shared literals from the computation of prob- 
abilities for the remaining literals. Moreover, Mr-SBC is tightly integrated with a 
relational database as in the work by Getoor, and handles categorical as well as nu- 
merical data through a discretization method. 

The paper is organized as follows. In the next section the problem is intro- 
duced and defined. The induction of first-order classification rules is presented in 
Section 3, the discretization method is explained in Section 4 and the classifica- 
tion model is illustrated in Section 5. Finally, experimental results are reported in 
Section 6 and some conclusions are drawn. 



2 Problem Statement 

In traditional classification systems that operate on a single relational table, an obser- 
vation (or individual) is represented as a tuple of the relational table. Conversely, in 
Mr-SBC, which induces first-order classifiers from data stored in a set 
S={T„,Tj,...,T^} of tables of a relational database, an individual is a tuple t of a target 
relation T joined with all the tuples in S which are related to t following a foreign key 
path. Formally, a foreign key path is defined as follows: 
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Def 1. A foreign key path is an ordered sequence of tables 'd=(7';^ , f ), 

where 

- Vj=l, 5 

Vj=l..j-1, has a foreign key to the table 7}^. 

In Fig.l an example of foreign key paths is reported. In this case, 
5= {MOLECULE, ATOM, BOND} and the foreign keys are: A_M_EK, B_M_EK, 
A_A_EK1, A_A_EK1. If the target relation T is MOLECULE then five foreign key 
paths exists. They are: (MOLECULE), (MOLECULE,ATOM), (MOLECULE, 

BOND), (MOLECULE, ATOM, BOND) and (MOLECULE, ATOM, BOND). 

The last two are equal because the bond table has two foreign keys referencing the 
table atom. 

A formal definition of the learning problem solved by MR-SBC is the following 
problem: 

Given: 

• A training set represented by means of h relational tables 5'={T„,T|,...,T^} of a 
relational database D. 

• A set of primary key constraints on tables in S. 

• A set of foreign key constraints on tables in S. 

• A target relation T(x^, xje S 

• a target discrete attribute y in T, different from the primary key of T. 

Find: 

A naive Bayesian classifier which predicts the value of y for some individual repre- 
sented as a tuple in T (with possibly UNKNOWN value for y) and related tuples in S 
according to foreign key paths. 




Fig. 1. An example of a relational representation of training data of the Mutagenesis database. 
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3 Generation of First-Order Rules 

Let R’ be a set of first-order classification rules for the classes {Cj, Q,..., C^}, and I 
an individual to be classified and defined as above. The individual can he logically 
represented as a set of ground facts, the only exception being the fact associated to the 
target relation T, where the argument corresponding to the target attribute y is a vari- 
able Y. A rule R.eR’ covers I, if a substitution 0 exists, such that RQ c IQ. The appli- 
cation of the substitution to I is required to ground the only variable F in / to the same 
constant as that reported in R for the target attribute. Let R be the subset of rules in R’ 
that cover /, that is R={R.eR’ \ R. covers I }. The first-order naive Bayes classifier for 
the individual is defined as follows: 

P(Ci)P(R\Ci) 

j(l)= arg max. P( C.\R) = arg max. P(R) — ~ 

The value P(CJ is the prior probability of the class C, Since P(R) is independent of 
the class C,., it does not affect f(I), that is, 

f(I)= arg max. P( Q )P(R\Ci ) (1) 

The computation of P(R\C) depends on the structure of R. Therefore, it is impor- 
tant to clarify how first-order rules are built in order to associate them with a prob- 
ability measure. As already pointed out, Pompe and Kononenko use the first-order 
learning system ILP-R to induce the set of rules R’. This approach is very expensive 
and does not take into account the bias automatically determined by the constraints in 
the database. On the other hand, Flach and Lachiche use Tertius to determine the 
structure of first-order features on the basis of the structure of the individuals. The 
system Tertius deals with learning first-order logic rules from data lacking an explicit 
classification predicate. Consequently, the learned rules are not restricted to predicate 
definitions as in supervised inductive logic programming. Our solution is similar to 
that proposed by Flach since the structure of classification rules is determined on the 
basis of the structure of the individuals. The main difference is that the classification 
predicate is considered during the generation of the rules. 

All predicates in classification rules generated by Mr-SBC are binary and can be of 
two different types. 

Def 2. A binary predicate p is a structural predicate associated to a table Tg S if a 
foreign key FK in Ti exists that references a table T,g S. The first argument of p repre- 
sents the primary key of T.^ and the second argument represents the primary key of T.. 

Def 3. A binary predicate p is a property predicate associated to a table Tg S, if the 
first argument of p represents the primary key of T. and the second argument repre- 
sents another attribute in T. which is neither the primary key of T, nor a foreign key in 
T.. 

Def 4. A first order classification rule associated to the foreign key path fl is a clause 
in the form: 

p„(Aj,y):- Pi(Aj,A,), p.CA^A,), ..., p^.j(A.j,A), p/A ,c). 

where 
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1 . Po is a property predicate associated to the target table T and to the target attribute 

J- 

2. Ti ) is a foreign key path such that for each k=\, s-1: p^, is a 

structural predicate associated to the table 

3. is a property predicate associated to the table 7] . 

An example of a first-order rule is the following: 

molecule _Label( A, active) molecule_Atom(A,B), atom_Type(B,’[22..27]’). 

Mr-SBC searches all possible classification rules by means of a breadth-first strat- 
egy and iterates over some refining steps. A refining step is biased by the possible 
foreign key paths and consists of the addition of a new literal, the unification of two 
variables and, in the case of a property predicate, in the instantiation of a variable. 
The search strategy is biased by the structure of the database because each refining 
step is made only if the generated first-order classification rule can be associated to a 
foreign key path. However, the number of refinement steps is upper bounded by a 
user-defined constant MAX_LEN_PATH. 



4 Discretization 

In Mr-SBC continuous attributes are handled through supervised discretization. Su- 
pervised discretization methods utilize the information on the class labels of individu- 
als to partition a numerical interval into bins. The proposed algorithm sorts the ob- 
served values of a continuous feature and attempts to greedily divide the domain of 
the continuous variable into bins, such that each bin contains only instances of one 
class. Since such a scheme could possibly lead to one bin for each observed real 
value, the algorithm is constrained to merge bins in a second step. Merging of two 
contiguous bins is performed when the increase of entropy is lower than a user- 
defined threshold (MAX_GAIN). This method is a variant of the one-step method 
IRD by Holte [14] for the induction of one-level decision trees, that proved to work 
well with the Naive Bayes Classifier [4]. It is also different from the one-step method 
by Fayyad and Irani [6] that recursively splits the initial interval according to the class 
information entropy measure until a stopping criterion based on the Minimum De- 
scription Length (MDL) principle is verified. 



5 The Computation of Probabilities 

According to the naive Bayes assumption, the attributes are considered independent. 
However, this assumption is clearly false for the attributes that are primary keys or 
foreign keys. This means that the computation of P(/?|C,) in equation (1) depends on 
the structures of rules in R. For instance, if Rj and R^ are two rules of class Cp that 
share the same structure and differ only for the property predicates in their bodies 

^ 2 - P2fl'-~P2,\^-^P2,K2-\>P2,K2 

where 
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K-K^ and -^ 2 ,i’A ,2 - ^ 2 , 2 ’-’ A./tj-i - 
then j = r^p 2 ,K, |Q) = 

^(A,o '^(A,i’->A,/ti-i)l A')'-P(A,^:i ^Pi,k 2 I A,o '^(A,i’-’A,tfi-i)'^ A) 

The first term takes into account the structure common to hoth rules while the second 
term refers to the conditional prohability of satisfying the property predicates in the 
rules given the common structure. 

The latter probability can be factorized under the naive Bayes assumption, that is: 

P^P\,Ki <^Pi,K2 I A,o '^(A, l>-,A,A:i-l)'^Ci) = 

I A,o '^(A.i’-’A.^ii-i)'^ A')'^(A,ir2 1 A,o '^(A.i’-’A.ifi-i)'^ A') 

According to this approach the conditional probability of the structure is computed 
only once. This approach differs from that proposed in the works of Pompe and Ko- 
nonenko [20] and Flach [7] where the factorization would multiply the structure prob- 
ability twice. 

By generalizing to a set of classification rules we have: 

P( Cj )P(R\Cj ) = P( Cl )P(stmcture)Y\ P(Rj\structure) (2) 

j 

where the term structure takes into account the class C and the structural parts of the 
rules in R. 

If the classification rule /? g is in the form Pj g ■ ~Pj Aj k -I’Pj k where 



Pj 0 and Pj are property predicates and P^ \,Pj 2 



Pj f. _j are structural predi- 



cates, then: 

P(Rj\structure)= P(Pjj^_ \ Ay.i 



C’Kj- 



1) - P(Pj,Kj I Ci,Pj l,..;Pj Kj~\) 



where C. is the value of the target attribute in the head of the clause ( q ). To com- 
pute this probability, we use the Laplace estimation: 



P(Pj^K^\Ci,Pj, 



#iPj,K,,Ci,Pj^l,...,Pj^j,^_0 + \ 
HCi,pj,,...,Pj^i,^.0 + F 



where F is the number of possible values of the attribute to which the pj property 

predicate is associated. Laplace’s estimate is used in order to avoid null probabilities 
in the equation (2). In practice, the value at the nominator is the number of individuals 
which satisfy that conjunction Pj ^ f^Ci,Pji,...,PjK.- _[ , in other words, the number 

of individuals covered by the rule P j g : ~P j i,...,Pj k -i^Pj k - ■ is determined by a 

“select count (*)’’ SQL instruction. The value of the denominator is the number of 
individuals covered by the rule Pj q : -p j | ,..., /Jy ^ . 

The term P( structure) in the equation (2) is computed as follows: Let 
^={{ Pji, Pj 2 p j ,)\j=l..s and t=l, -1 1 the set of all distinct sequences of 

structural predicates in the rules of R. Then 



P( structure )= Y[P(seq) 

seqe B 



( 3 ) 
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To compute P(seq) it is necessary to introduce the definition of the probability JP 
that a join query is satisfied, for this purpose, the formulation provided in [11] can be 
useful. Let i}=( I) ) be a Foreign Key Path, then: 



JP(d)=JP(T,. 



)= 



|x(T,. X...XT,.^)| 
\\ |X..X|T,.^ I 



where i>< (T,-^ x...xT, ) is the result of the join between the tables T,-^,...,T,- . 

We must remember that each sequence seq is associated to a foreign key path ft. If 
seq={ P jx, p j 2 , ■■■, pjt) there are two possibilities: either a prefix of seq is in B or 



not. By denoting as the table related to Pji^,h=l,...,t, the probability P(seq) 
can be recursively defined as follows: 



P(seq) = 






P(seq’) 



if seq has no prefix in B 
if seq’ is the longest prefix of seq in B 



This formulation is necessary in order to compute the formula (3) considering both 
dependent and independent events. Since P( structure) takes into account the class, 
P(seq) is computed separately for each class. 



6 Experimental Results 

MR-SBC has been implemented as a module of the system MURENA and has been 
empirically evaluated on the Mutagenesis datasets and on Biodegradability datasets. 



6.1 Results on Mutagenesis 

These datasets, taken from the MLNET repository, concern the problem of identifying 
the mutagenic compounds [19] and have been extensively used to test both inductive 
logic programming (ILP) systems and (multi-)relational mining systems. We consid- 
ered, analogously to related experiments in the literature, the “regression friendly” 
dataset of 188 elements. 

A recent study on this database [22] recognizes five levels of background knowl- 
edge for mutagenesis which can provide richer descriptions of the examples. In this 
study we used only the first three levels of background knowledge in order to com- 
pare the performance of Mr-SBC with other methods for which experimental results 
are available in the literature. Table 1 shows the first three sets of background knowl- 
edge used in our experiments, where BK.^ BK, for i=0, ..., 2. The greater the BK, 
the more complex the learning problem. 

The dataset is analyzed by means of a 10-fold cross-validation, that is, the target 
table is first divided into ten blocks of near-equal size and distribution of class values, 
and then, for every block, a subset of tuples in S related to the tuples in the target table 
block are extracted. In this way, ten databases are created. Mr-SBC is trained on nine 
databases and tested on the hold-out database. Mr-SBC has been executed with the 
following parameters: MAX_LEN_PATH=4 and MAX_GAIN= 0.5. 




Mr-SBC: A Multi-relational Naive Bayes Classifier 103 



Table 1. Background knowledge for Mutagenesis database. 



Background 


Description 


BK„ 


Consists of those data obtained with the molecular modelling package 
QUANTA. For each compound it obtains the atoms, bonds, bond types, 
atom types, and partial charges on atoms. 


BK, 


Consists of Definitions in BO plus indicators indl, and inda in molecule 
table. 


BK, 


Variables (attributes) logp, and lumo are added to definitions in BK^. 



Table 2. Accuracy comparison on the set of 188 regression friendly elements of Mutagenesis. 
Results for Progol2, Foil, Tilde are taken from [1]. Results for Progol_l are taken from [22]. 
The results for IBC are taken from [9]. Results for 1BC2 are taken from [16]. Results for 
MRDTL are taken from [17]. The values are the results of 10-fold cross-validation. 



System 


Accuracy(%) \ 


BK 


BK, 


BK, 


Progol_l 


79 


86 


86 


Progol_2 


76 


81 


86 


Foil 


61 


61 


83 


Tilde 


75 


79 


85 


MRDTL 


67 


87 


88 


1BC2 


72.9 


-- 


72.9 


IBC 


80.3 


— 


87.2 


Mr-SBC 


76.5 


81 


89.9 



Experimental results on predictive accuracy are reported in Table 2 for increasing 
complexity of the models. A comparison to other results reported in the literature is 
also made. Mr-SBC has the best performance for the most complex task (BK^) with 
an accuracy of almost 90%, while it ranks third for the simplest task. Interestingly, the 
predictive accuracy increases with the complexity of the background knowledge, 
which means that the variables added in BKj and BK^ are meaningful and Mr-SBC 
takes advantages of that. 

As regards execution time (see Table 3). The time required by Mr-SBC increases 
with the complexity of the background knowledge. Mr-SBC is generally considerably 
faster than competing systems, such as Progol, Foil, Tilde and IBC, that do not oper- 
ate on data stored in a database. Moreover, except for the task BK„, Mr-SBC performs 
better that MRDTL which works on a database. It is noteworthy that the trade-off 
between accuracy and complexity is in favour of Mr-SBC. 

The average number of extracted rules for each fold is quite high (55.9 for BK„, 
59.9 for BKj, and 64.8 for BK^). Some rules are either redundant or cover very few 
individuals. Therefore, some additional stopping criteria are required to avoid the 
generation of these rules and to reduce further the cost complexity of the algorithm. 



6.2 Results on Biodegradability 

The Biodegradability dataset has already been used in the literature for both regres- 
sion and classification tasks [5]. It consists of 328 structural chemical molecules de- 
scribed in terms of atom and bond. The target variable for machine learning systems 
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is the natural logarithm of the arithmetic mean of the low and high estimate of the 
HTL (Half-Life Time) for acqueous biodegradation in aerobic conditions, measured 
in hours. We use a discretized version in order to apply classification systems to the 
problem. As in [5], four classes have been defined: chemicals degrade /ast, moder- 
ately, slowly or are resistant. 



Table 3. Time comparison of the set of 188 regression friendly elements of Mutagenesis. Re- 
sults for Progol2, Foil, Tilde are taken from [1]. Results for Progol_l are taken from [22]. 
Results for MRDTL are taken from [17]. The results of MR-SBC have been taken on a PHI 
WIN2k platform. 



System 


Time (Secs) | 


BK, 


BK, 


BK^ 


Progol_l 


8695 


4627 


4974 


Progol_2 


117000 


64000 


42000 


Foil 


4950 


9138 


0.5 


Tilde 


41 


170 


142 


MRDTL 


0.85 


170 


142 


1BC2 


- 


- 


- 


IBC 


- 


- 


- 


MR-SBC 


36 


42 


48 



Table 4. Accuracy comparison on the set of 328 chemical molecules of Biodegradability. 
Results for Mr-SBC and Tilde are reported. 



Fold 


Mr-SBC 


Tilde Pruned 


0 


0.90909 


0.69697 


1 


0.87878 


0.81818 


2 


0.84848 


0.90909 


3 


0.87878 


0.87879 


4 


0.78788 


0.69697 


5 


0.84848 


0.90909 


6 


0.90625 


0.90625 


7 


0.87879 


0.81818 


8 


0.87500 


0.93750 


9 


0.93939 


0.72727 


Average 


0.87509 


0.82983 



The dataset is analyzed by means of a 10-fold cross-validation. For each database 
Mr-SBC and Tilde are trained on nine databases and tested on the hold-out database. 
Mr-SBC has been executed with the following parameters: MAX_LEN_PATH=4 and 
MAX_GAIN= 0.5. Experimental results on predictive accuracy are reported in Table 
4. They are in favour of Mr-SBC on the average of accuracy varying the fold. 

7 Conclusions 



In the paper, a multi-relational data mining system with a tight integration to a rela- 
tional DBMS is described. It is based on the induction of a set of first-order classifi- 
cation rules in the context of naive Bayesian classification. It presents several differ- 
ences with respect to related works. Eirst, it is based on an integrated approach, so 
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that the contribution of literals shared by several rules to the posterior probability is 
computed only once. Second, it works both on discrete and continuous attributes. 
Third, the generation of rules is based on the knowledge of a data model embedded in 
the database schema. The proposed method has been implemented in the new system 
Mr-SBC and tested on four benchmark tasks. Results on predictive accuracy are in 
favour of our system for the most complex tasks. Mr-SBC also proved to be efficient. 

As future work, we plan to extend the comparison of Mr-SBC to other multi- 
relational data mining systems on a larger set of benchmark datasets. Moreover, we 
intend to frame the proposed method in a transduction inference setting, where both 
labelled and unlabelled data are available for training. Finally we intend to integrate 
Mr-SBC in a document processing system that makes extensive use of machine 
learning tools to reach a high adaptivity to different tasks. 
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Abstract. Many real world data mining applications involve learning from im- 
balanced data sets. Learning from data sets that contain very few instances of 
the minority (or interesting) class usually produces biased classifiers that have a 
higher predictive accuracy over the majority class(es), but poorer predictive ac- 
curacy over the minority class. SMOTE (Synthetic Minority Over-sampling 
TEchnique) is specifically designed for learning from imbalanced data sets. 
This paper presents a novel approach for learning from imbalanced data sets, 
based on a combination of the SMOTE algorithm and the boosting procedure. 
Unlike standard boosting where all misclassified examples are given equal 
weights, SMOTEBoost creates synthetic examples from the rare or minority 
class, thus indirectly changing the updating weights and compensating for 
skewed distributions. SMOTEBoost applied to several highly and moderately 
imbalanced data sets shows improvement in prediction performance on the mi- 
nority class and overall improved F-values. 



1 Motivation and Introduction 

Rare events are events that occur very infrequently, i.e. whose frequency ranges from 
say 5% to less than 0.1%, depending on the application. Classification of rare events 
is a common problem in many domains, such as detecting fraudulent transactions, 
network intrusion detection, Web mining, direct marketing, and medical diagnostics. 
For example, in the network intrusion detection domain, the number of intrusions on 
the network is typically a very small fraction of the total network traffic. In medical 
databases, when classifying the pixels in mammogram images as cancerous or not 
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[1], abnormal (cancerous) pixels represent only a very small fraction of the entire 
image. The nature of the application requires a fairly high detection rate of the minor- 
ity class and allows for a small error rate in the majority class since the cost of mis- 
classifying a cancerous patient as non-cancerous can be very high. 

In all these scenarios when the majority class typically represents 98-99% of the 
entire population, a trivial classifier that labels everything with the majority class can 
achieve high accuracy. It is apparent that for domains with imbalanced and/or skewed 
distributions, classification accuracy is not sufficient as a standard performance meas- 
ure. ROC analysis [2] and metrics such as precision, recall and F-value [3, 4] have 
been used to understand the performance of the learning algorithm on the minority 
class. The prevalence of class imbalance in various scenarios has caused a surge in 
research dealing with the minority classes. Several approaches for dealing with 
imbalanced data sets were recently introduced [1, 2, 4, 9-15]. 

A confusion matrix as shown in Table 1 is typically used to evaluate performance 
of a machine learning algorithm for rare class problems. In classification problems, 
assuming class “C” as the minority class of the interest, and “NC” as a conjunction of 
all the other classes, there are four possible outcomes when detecting class “C”. 



Table 1. Confusion matrix defines four possible scenarios when classifying class “C” 





Predicted Class “C” 


Predicted Class “NC” 


Actual class “C” 


Tme Positives (TP) 


False Negatives (FN) 


Actual class “NC” 


False Positives (FP) 


Tme Negatives (TN) 



From Table 1, recall, precision and F-value may be defined as follows: 

Precision = TP / (TP -H FP) 

Recall = TP / (TP H- FN) 

_ , (\-V )■ Recall- Precision 

p -value = ^ , 

P ■ Re call -f Pr ecision 

where /J corresponds to relative importance of precision vs. recall and it is usually set 
to 1 . The main focus of all learning algorithms is to improve the recall, without sacri- 
ficing the precision. However, the recall and precision goals are often conflicting and 
attacking them simultaneously may not work well, especially when one class is rare. 
The F-value incorporates both precision and recall, and the “goodness” of a learning 
algorithm for the minority class can be measured by the F-value. While ROC curves 
represent the trade-off between values of TP and FP, the F-value basically incorpo- 
rates the relative effects/costs of recall and precision into a single number. 

It is well known in machine learning that a combination of classifiers can be an ef- 
fective technique for improving prediction accuracy. As one of the most popular 
combining techniques, boosting [5] uses adaptive sampling of instances to generate a 
highly accurate ensemble of classifiers whose individual global accuracy is only 
moderate. There has been significant interest in the recent literature for embedding 
cost-sensitivity in the boosting algorithm. CSB [6] and AdaCost boosting algorithms 
[7] update the weights of examples according to the misclassification costs. Karakou- 
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las and Shawe-Taylor’s ThetaBoost adjusts the margins in the presence of unequal 
loss functions [8]. Alternatively, Rare-Boost [4, 9] updates the weights of the exam- 
ples differently for all four entries shown in Table 1. 

In this paper we propose a novel approach for learning from imbalanced data sets, 
SMOTEBoost, that embeds SMOTE [1], a technique for countering imbalance in a 
dataset, in the boosting procedure. We apply SMOTE during each boosting iteration 
in order to create new synthetic examples from the minority class. SMOTEBoost 
constructs focuses on the minority class examples sampled for each boosting itera- 
tion, and constructs new examples. Experiments performed on data sets from several 
domains have shown that SMOTEBoost is able to achieve a higher F-value than 
SMOTE applied to a classifier, standard boosting algorithm, AdaCost [7] and first 
smote then boosting for each of the datasets. We also provide a precision-recall 
analysis of the approaches. 



2 Synthetic Minority Oversampling Technique - SMOTE 

SMOTE (Synthetic Minority Oversampling Technique) was proposed to counter the 
effect of having few instances of the minority class in a data set [1]. SMOTE creates 
synthetic instances of the minority class by operating in the “feature space” rather 
than the “data space”. By synthetically generating more instances of the minority 
class, the inductive learners, such as decision trees (e.g. C4.5 [16]) or rule-learners 
(e.g. RIPPER [17]), are able to broaden their decision regions for the minority class. 
We deal with nominal (or discrete) and continuous attributes differently in SMOTE. 
In the nearest neighbor computations for the minority classes we use Euclidean dis- 
tance for the continuous features and the Value Distance Metric (with the Euclidean 
assumption) for the nominal features [1, 18, 19]. The new synthetic minority samples 
are created as follows: 

• For the continuous features 

o Take the difference between a feature vector (minority class sample) and one 
of its k nearest neighbors (minority class samples). 

o Multiply this difference by a random number between 0 and 1 . 

o Add this difference to the feature value of the original feature vector, thus 
creating a new feature vector 

• For the nominal features 

o Take majority vote between the feature vector under consideration and its k 
nearest neighbors for the nominal feature value. In the case of a tie, choose 
at random. 

o Assign that value to the new synthetic minority class sample. 

Using this technique, a new minority class sample is created in the neighborhood 
of the minority class sample under consideration. The neighbors are proportionately 
utilized depending upon the amount of SMOTE. Hence, using SMOTE, more general 
regions are learned for the minority class, allowing the classifiers to better predict 
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unseen examples belonging to the minority class. A combination of SMOTE and 
under- sampling creates potentially optimal classifiers as a majority of points from the 
SMOTE and under- sampling combination lie on the convex hull of the family of 
ROC curves [1,2]. 



3 SMOTEBoost Algorithm 

In this paper, we propose a SMOTEBoost algorithm that combines the Synthetic 
Minority Oversampling Technique (SMOTE) and the standard boosting procedure. 
We want to utilize SMOTE for improving the prediction of the minority classes, and 
we want to utilize boosting to not sacrifice accuracy over the entire data set. Our goal 
is to better model the minority class in the data set, by providing the learner not only 
with the minority class instances that were misclassified in previous boosting itera- 
tions, but also with a broader representation of those instances. We want to improve 
the overall accuracy of the ensemble by focusing on the difficult minority (positive) 
class cases, as we want to model this class better, with minimal accuracy degradation 
for the majority class The goal is to improve our True Positives (TP). 

The standard boosting procedure gives equal weights to all misclassified examples. 
Since boosting algorithm samples from a pool of data that predominantly consists of 
the majority class, subsequent samplings of the training set may still be skewed to- 
wards the majority class. Although boosting reduces the variance and the bias in the 
final ensemble, it might not be as effective for data sets with skewed class distribu- 
tions.. Boosting algorithm (Adaboost) treats both kinds of errors (FP and FN) in a 
similar fashion, and therefore sampling distributions in subsequent boosting iterations 
could have a larger composition of majority class cases. 

Our goal is to reduce the bias inherent in the learning procedure due to the class 
imbalance. Introducing SMOTE in each round of boosting will enable each learner to 
learn from more of the minority class cases, thus learning broader decision regions 
for the minority class. We only SMOTE for the minority class examples in the distri- 
bution Z), at the iteration t. This has an implicit effect of increasing the sampling 
weights of minority class cases, as new examples are created in D^. The synthetically 
created minority class cases are discarded after learning a classifier at iteration t. That 
is, they are not added to the original training set, and new examples are constructed in 
each iteration t, by sampling from D,. The error-estimate after each boosting iteration 
is on the original training set. Thus, we try to maximize the margin for the skewed 
class dataset, by adding new minority class cases before learning a classifier in a 
boosting iteration. We also conjecture that introducing the SMOTE procedure also 
increases the diversity amongst the classifiers in the ensemble, as in each iteration we 
produce a different set of synthetic examples, and therefore different classifiers. The 
amount of SMOTE is a parameter that can vary for each data set. It will be useful to 
know a priori the amount of SMOTE to be introduced for each data set. We believe 
that utilizing a validation set to set the amount of SMOTE before the boosting itera- 
tions can be useful. 
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The combination of SMOTE and the boosting procedure that we present here is a 
variant of the AdaBoost.M2 procedure [5]. The proposed SMOTEBoost algorithm, 
shown in Fig. 1, proceeds in a series of T rounds. In every round a weak learning 
algorithm is called and presented with a different distribution altered by emphasiz- 
ing particular training examples. The distribution is updated to give wrong classifica- 
tions higher weights than correct classifications. Unlike standard boosting, where the 
distribution Z), is updated uniformly for examples from both the majority and minor- 
ity classes, in the SMOTEBoost technique the distribution is updated such that the 
examples from the minority class are oversampled by creating synthetic minority 
class examples (See Line 1, Fig. 1). The entire weighted training set is given to the 
weak learner to compute the weak hypothesis At the end, the different hypotheses 
are combined into a final hypothesis h^. 



• Given: Set S {(Xj, yj), ... , {x^, yj} x, eX, with labels y, eY = {1, ..., C}, 
where C^,, (Cj, < C) corresponds to a minority (positive) class. 

• Let B = {(i, y): i = l,...,m, y y,} 

• Initialize the distribution Dj over the examples, such that ZZ/i) = 1/m. 

• For t = 1, 2, 3, 4, ... r 

1 . Modify distribution Z), by creating N synthetic examples from minority class 
Cp using the SMOTE algorithm 

2. Train a weak learner using distribution 

3. Compute weak hypothesis X x Y — ^ [0, 1] 

4. Compute the pseudo-loss of hypothesis h,: 

£,= i,y)(\-h,( x,.,y,. )+h,( x,,y)) 

(i,y)eB 

5. Set P, = ej (1 - £■) and vv, = (l/2)-(l-/?,(Xj,y)-i-/i,(X;,yj)) 

6. Update D, : Z),^, (i, y) = (D,(i,y)/Z, 

where Z, is a normalization constant chosen such that is a distribution. 

. 0„,pu„he final hypochesi. 



Fig. 1. The SMOTEBoost algorithm 

We used RIPPER [17], a learning algorithm that builds a set of rules for identify- 
ing the classes while minimizing the amount of error, as the classifier in our 
SMOTEBoost experiments. RIPPER is a rule-learning algorithm based on the sepa- 
rate-and-conquer strategy. We applied SMOTE with different values for the parame- 
ter A that specifies the amount of synthetically generated examples. 
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4 Experiments 

Our experiments were performed on the four data sets summarized in Table 2. For all 
data sets, except for the KDD Cup-99 intrusion detection data set [20, 21], the re- 
ported (averaged) values for recall, precision and F-value were obtained by perform- 
ing 10-fold cross-validation. For the KDDCup-99 data set however, the separate 
intrusion detection test set was used to evaluate the performance of proposed algo- 
rithms. Since the original training and test data sets have totally different distributions 
due to novel intrusions introduced in the test data, for the purposes of this paper, we 
modified the data sets in order to have similar distributions in the training and test 
data. Therefore, we first merged the original training and test data sets and then sam- 
pled 69,980 network connections from this merged data set in order to reduce the size 
of the data set. The sampling was performed only from majority classes (normal 
background traffic and the DoS attack category), while other classes (Probe, U2R, 
R2L) remained intact. Finally, the new train and test data sets used in our experiments 
were obtained by randomly splitting the sampled data set into equal size subsets. The 
distribution of network connections in the new test data set is given in Table 2. 
Unlike the KDDCup-99 intrusion data set that has a mixture of both nominal and 
continuous features, the remaining data sets (mammography [1], satimage [22], pho- 
neme [23]) have all continuous features. For the satimage data set we chose the 
smallest class as the minority class and collapsed the remaining classes into one class 
as was done in [24]. This procedure gave us a skewed 2-class dataset, with 5809 
majority class examples and 626 minority class examples. 



Table 2. Summary of data sets used in experiments 



Data set 


Number of majority 
class instances 


Number of minority 
class instances 


Number of 
classes 


KDDCup-99 

Intrusion 


DoS 


Probe 


Normal 


U2R 


R2L 


5 


13027 


2445 


17400 


136 


1982 


Mammography 


10923 


260 


2 


Satimage 


5809 


626 


2 


Phoneme 


3818 


1586 


2 



When experimenting with SMOTE and the SMOTEBoost algorithm, different val- 
ues for the SMOTE parameter N, ranging between 100 and 500, were used for the 
minority classes. Since the KDD Cup’ 99 data set has two minority classes U2R and 
R2L that are not equally represented in the data set, different combinations of 
SMOTE parameters were investigated for these two minority classes (values 100, 
300, and 500 were used for the U2R class while the value 100 was used for the R2L 
class). The values of the SMOTE parameters for U2R class were higher than the 
SMOTE parameter values for R2L class, since the U2R class is rarer than the R2L 
class in KDD-Cup 1999 data set (R2L has a larger number of examples). Our ex- 
perimental results showed that the higher values of SMOTE parameters for the R2L 
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class could lead to over-fitting and decreasing the prediction performance on that 
class (since SMOTEBoost achieved only minor improvements for the R2L class, 
these results are not reported here due to space limitations). 

The experimental results for all four data sets are presented in Tables 3 to 6 and in 
Figures 2 to 4. It is important to note that these tables report only the prediction per- 
formance for the minority classes from four data sets, since prediction of the majority 
class was not of interest in this study. Moreover, precision captures the FP’s intro- 
duced in the classification. So, F-value includes the estimate for majority class exam- 
ples wrongly classified. Due to space limitations, the figures with precision and recall 
trends over the boosting iterations, along with the F-value trends for the representa- 
tive SMOTE parameter were not shown for the R2L class from KDDCup’99 data as 
well as for the satimage data set. In addition, the left and the right parts of the re- 
ported Figures do not have the same scale due to the fact that the range of changes in 
recall and precision shown in the same graph is much larger than the change of the F- 
value. 



Table 3. Final values for recall, precision and F-value for minority U2R class when proposed 
methods are applied on KDDCup-99 intrusion data set. corresponds to the SMOTE pa- 
rameter for U2R class, while corresponds to the SMOTE parameter for R2L class) 



Method 


Recall 


Precision 


F-value 


Method 


Recall 


Precision 


F-value 


Standard RIPPER 


57.35 


84.78 


68.42 


Standard Boosting 


80.15 


90.083 


84.83 






r2l 


Recall 


Precision 


F-value 




A,, 


r2l 


Recall 


Precision 


F-value 


SMOTE 


100 


100 


80.15 


88.62 


84.17 


SMOTE 

-Boost 


100 


100 


84.2 


93.9 


88.8 


300 


100 


74.26 


92.66 


82.58 


300 


100 


87.5 


88.8 


88.15 


500 


100 


68.38 


86.11 


71.32 


500 


100 


84.6 


92.0 


88.1 


First 




N., 


Recall 


Precision 


F-value 


Ada- 

Cost 


Cost 

factor 


Recall 


Precision 


F-value 


SMOTE 

then 

Boost 


100 


100 


81.6 


90.92 


86.01 


300 


100 


82.5 


89.30 


85.77 


c = 2 


83.1 


96.6 


89.3 


500 


100 


82.9 


89.12 


85.90 


c = 5 


83.45 


95.29 


88.98 



Table 4. Final values for recall, precision and F-value for minority class when proposed meth- 
ods are applied on mammography data set 



Method 


Recall 


Precision 


F-value 


Method 


Recall 


Precision 


F-value 


Standard RIPPER 


48.12 


74.68 


58.11 


Standard Boosting 


59.09 


77.05 


66.89 


SMOTE 


N= 100 


58.04 


64.96 


61.31 


SMOTE 

-Boost 


A= 100 


61.73 


76.59 


68.36 


A = 200 


62.16 


60.53 


60.45 


A = 200 


62.63 


74.54 


68.07 


A= 300 


62.55 


56.57 


58.41 


A= 300 


64.16 


69.92 


66.92 


A = 500 


64.51 


53.81 


58.68 


A = 500 


61.37 


70.41 


65.58 


First 

SMOTE 

then 

Boost 


A= 100 


60.22 


76.16 


67.25 


Ada- 

Cost 


Cost 

factor 


Recall 


Precision 


F-value 


A = 200 


62.61 


72.10 


61 m 


A= 300 


63.92 


70.26 


66.94 


2 


59.83 


69.07 


63.01 


A = 500 


64.14 


69.80 


66.85 


5 


68.45 


55.12 


59.36 
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Table 5. Final values for recall, precision and F-value for minority class when proposed meth- 
ods are applied on Satimage data set 



Method 


Recall 


Precision 


F-value 


Method 


Recall 


Precision 


F-value 


Standard RIPPER 


47.43 


67.92 


55.50 


Standard Boosting 


58.74 


80.12 


67.78 


SMOTE 


N= 100 


65.17 


55.88 


59.97 


SMOTE 

-Boost 


o 

o 

II 


63.88 


77.71 


70.12 


IV = 200 


74.89 


48.08 


58.26 


N = 200 


65.35 


73.17 


69.04 


IV = 300 


76.32 


47.17 


57.72 


N=300 


67.87 


72.68 


70.19 


IV = 500 


77.96 


44.51 


56.54 


IV = 500 


67.73 


69.5 


68.6 


Eirst 

SMOTE 

then 

Boost 


N= 100 


64.69 


72.53 


68.65 


Ada- 

Cost 


Cost 

factor 


Recall 


Precision 


F-value 


IV = 200 


69.23 


67.10 


68.15 


IV = 300 


67.25 


69.92 


68.56 


2 


64.85 


54.58 


58.2 


IV = 500 


67.84 


68.02 


67.93 


5 


60.85 


56.01 


57.6 



Table 6. Final values for recall, precision and F-value for minority class when proposed meth- 
ods are applied on phoneme data set 



Method 


Recall 


Precision 


F-value 


Method 


Recall 


Precision 


F-value 


Standard RIPPER 


62.28 


69.13 


65.15 


Standard Boosting 


76.1 


77.07 


76.55 


SMOTE 


N= 100 


82.18 


59.91 


68.89 


SMOTE 

-Boost 


N= 100 


81.86 


73.66 


1131 


A = 200 


85.88 


58.51 


69.59 


A = 200 


84.86 


16A1 


76.47 


A= 300 


89.79 


56.15 


69.04 


A= 300 


86 


66.76 


75.16 


A = 500 


94.2 


50.22 


65.49 


A = 500 


88.46 


65.16 


75.04 


Eirst 

SMOTE 

then 

Boost 


N= 100 


82.05 


72.34 


76.89 


Ada- 

Cost 


Cost 

factor 


Recall 


Precision 


F-value 


A = 200 


85.25 


68.97 


76.25 


A= 300 


87.37 


66.38 


75.44 


2 


76.83 


75.71 


75.99 


A = 500 


89.21 


64.73 


75.03 


5 


85.05 


68.71 


75.9 



U2R SMOTE parameter = 100, R2L SMOTE parameter = 100 



U2R SMOTE parameter = 1 00. R2L SMOTE parameter = 1 00 
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Fig. 2. Precision, Recall, and F-values for the minority U2R class when the SMOTEBoost 
algorithm is applied on the KDDCup 1999 data set 
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Fig. 3. Precision, Recall, and F-values for the minority class when the SMOTEBoost algorithm 
is applied on the Mammography data set 




Fig. 4. Precision, Recall, and F-values for the minority class when the SMOTEBoost algorithm 
is applied on the Satimage data set 

Analyzing Figures 2 to 4 and Tables 3 to 6, it is apparent that SMOTEBoost 
achieved higher F-values than the other presented methods including standard boost- 
ing, AdaCost, SMOTE with the RIPPER classifier and the standard RIPPER classi- 
fier, although the improvement varied with different data sets. We have also com- 
pared SMOTEBoost to the procedure “Eirst SMOTE, then Boost” when we first 
apply SMOTE and then perform boosting in two separate steps. It is SMOTEBoost’ s 
apparent improvement in recall, while not causing a significant degradation in preci- 
sion that improves the over-all F-value. Tables 3 to 6 include the precision, recall, 
and F-value for the various methods at different amounts of SMOTE (best values are 
given in bold). These reported values indicate that SMOTE applied with the RIPPER 
classifier has the effect of improving the recall of the minority class due to improved 
coverage of the minority class examples, while at the same time SMOTE causes a 
decrease in precision due to an increased number of false positive examples. Thus, 
SMOTE is more targeted to the minority class than standard boosting or RIPPER. On 
the other hand, standard boosting is able to improve both the recall and precision of a 
single classifier, since it gives all errors equal weights. SMOTE embedded within the 
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boosting procedure additionally improved the recall achieved by the boosting proce- 
dure, and did not cause a significant degradation in precision, thus increasing the F- 
value. SMOTE as a part of SMOTEBoost allows the learners to broaden the minority 
class scope, while boosting on the other hand aims at reducing the number of false 
positives and false negatives. 

Tables 3 to 6 show the precision, recall, and E-values achieved hy varying the 
amount of SMOTE for each of the minority classes for all four data sets used in our 
experiments. We report the aggregated result of 25 boosting iterations in the tables. 
The improvement was generally higher for the data sets where the skew among the 
classes was also higher. Comparing SMOTEBoost and AdaBoost.Ml, for the KDD- 
Cup’99 data set, the (relative) improvement in F-value for the U2R class (~4%) was 
drastically higher than for the R2L class (0.61%). The U2R class was significantly 
less represented in the data set than the R2L class (the number of U2R examples was 
around 15 times smaller than the number of examples from the R2L class). In addi- 
tion, the (relative) improvements in F-value for the mammography (2.2%) and sati- 
mage (3.4%) data sets were better than for the phoneme data set (1.4%), which had 
much less imbalanced classes. Eor phoneme data, boosting and SMOTE Boost were 
comparable to each other, while for higher values of the SMOTE parameter N, boost- 
ing was even better than SMOTEBoost. In this data set the number of majority class 
examples is only twice the number of minority class examples, and increasing the 
SMOTE parameter N to values larger than 200 causes the minority class to become 
the majority. Hence, the classifiers in the SMOTEBoost ensemble will now tend to 
over-learn the minority class, causing a higher degradation in precision for the minor- 
ity class and therefore a reduction in F-value. 

We have also shown that SMOTEBoost gives higher F-values than the AdaCost 
algorithm [7]. The cost-adjustment functions from the AdaCost algorithm were cho- 
sen as follows: /?- = 0.5*c -t 0.5 and = -0.5*c + 0.5, where /J- and are the func- 
tions for mislabeled and correctly labeled examples, respectively. AdaCost causes a 
greater sampling from the minority class examples due to the /? function in the boost- 
ing distribution. This implicitly has an effect of over-sampling with replication. 
SMOTEBoost on the other hand constructs new examples at each round of boosting, 
thus avoiding overfitting and achieving higher minority class classification perform- 
ances than AdaCost. Although AdaCost improves the recall over AdaBoost, it sig- 
nificantly reduces the precision thus causing a reduction in F-value. It is also interest- 
ing to note that SMOTEBoost achieves better F-values than the procedure “First 
SMOTE, then Boost” since in every boosting iteration new examples from minority 
class are generated, and thus, more diverse classifiers are created in the boosting 
ensemble. Finally, SMOTEBoost particularly focuses on the examples selected in the 
Dt, which are potentially misclassified or are on the classification boundaries. 

5 Conclusions 

A novel approach for learning from imbalanced data sets is presented. The proposed 
SMOTEBoost algorithm is based on the integration of the SMOTE algorithm within 
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the standard boosting procedure. Experimental results from several imbalanced data 
sets indicate that the proposed SMOTEBoost algorithm can result in better prediction 
of minority classes than AdaBoost, AdaCost, “First SMOTE then Boost” procedure 
and a single classifier. Data sets used in our experiments contained different degrees 
of imbalance and different sizes, thus providing a diverse test bed. 

The SMOTEBoost algorithm successfully utilizes the benefits from both boosting 
and the SMOTE algorithm. While boosting improves the predictive accuracy of clas- 
sifiers by focusing on difficult examples that belong to all the classes, the SMOTE 
algorithm improves the performance of a classifier only on the minority class exam- 
ples. Therefore, the embedded SMOTE algorithm forces the boosting algorithm to 
focus more on difficult examples that belong to the minority class than to the majority 
class. SMOTEBoost implicitly increases the weights of the misclassified minority 
class instances (false negatives) in the distribution by increasing the number of 
minority class instances using the SMOTE algorithm. Therefore, in the subsequent 
boosting iterations SMOTEBoost is able to create broader decision regions for the 
minority class compared to the standard boosting. We conclude that SMOTEBoost 
can construct an ensemble of diverse classifiers and reduce the bias of the classifiers. 
SMOTEBoost combines the power of SMOTE in vastly improving the recall with the 
power of boosting in improving the precision. The overall effect is a better F-value. 

Our experiments have also shown that SMOTEBoost is able to achieve higher E- 
values than AdaCost, due to SMOTE's ability to improve the coverage of the minority 
class when compared to the indirect effect of oversampling with replication in 
AdaCost. 

Although the experiments have provided evidence that the proposed method can 
be successful for learning from imbalanced data sets, future work is needed to address 
its possible drawbacks. Eirst, automatic determination of the amount of SMOTE will 
not only be useful when deploying SMOTE as an independent approach, but also for 
combining SMOTE and boosting. Second, our future work will also focus on investi- 
gating the effect of mislabeling noise on the performance of SMOTEBoost, since it is 
known that boosting does not perform well in the presence of noise. 
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Abstract. We consider the classification of structured (e.g. XML) tex- 
tual documents. We first propose a generative model based on Belief 
Networks which allows us to simultaneously take into account struc- 
ture and content information. We then show how this model can be 
extended into a more efficient classifier using the Fisher kernel method. 
In both cases model parameters are learned from a labelled training set 
of representative documents. We present experiments on two collections 
of structured documents: WebKB which has become a reference corpus 
for HTML page classification and the new INEX corpus which has been 
developed for the evaluation of XML information retrieval systems. 

Keywords: textual document classification, structured document, XML 
corpus. Belief Networks, Fisher Kernel. 



1 Introduction 

The development of large electronic document collections and Web resources has 
been paralleled by the emergence of structured document format proposals. They 
are aimed at encoding content information in a suitable form, for a variety of 
information needs. These document formats allow us to enrich the document con- 
tent with additional information (document logical structure, meta-data, com- 
ments, etc) and to store and access the documents in a more efficient way. Some 
proposals have already gained somea popularity and description languages like 
XML are already widely used by different communities. For text documents, 
these representations encode both structural and content information. 

With the development of structured collections, there is a need to develop 
information access methods which may take all the benefit of these richer repre- 
sentations and also allow to answer new information access challenges and new 
user needs. Current Information Retrieval (IR) methods have mainly been devel- 
oped for handling flat document representations and cannot be easily adapted 
to deal with structured representations. 

In this paper, we focus on the particular task of structured document catego- 
rization. We propose methods for exploiting both the content and the structure 
information for this task. Our core model is a generative categorization model 
based on belief networks (BN). This work offers a natural framework for en- 
coding structured representations and allows us to perform inference both on 
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whole documents and on document subparts. We then show how to turn this 
generative model into a discriminant classification model using the Fisher kernel 
trick. Paper is organized as follows: we make in 2 a brief review of previous work 
on structured document classification, we describe in 3 the type of structured 
document we are working on, we then introduce in 4 our generative model and 
the discriminant model in 5. Section 6 presents a series of experiments performed 
on two textual collections, the WebKB [20] and the INEX Corpus [7]. 

2 Previous Works 

Text categorization is a classical information retrieval task which has motivated 
a large amount of work over the last fews years. Most categorization models 
have been designed for handling bag of words representations and do not con- 
sider word ordering or document structure. Generally speaking, classifiers fall 
into two categories: generative models which estimate class conditional densities 
P {document /Class) and discriminant models which directly estimate the poste- 
rior probabilities P {C las s / document) . The naive Bayes model [12] for example is 
a popular generative categorization model whereas among discriminative tech- 
niques support vector machines [10] have been widely used over the last few 
years. [17] makes a complete review of flat document categorization methods. 
More recently, models which take into account sequence information have been 
proposed [3]. Classifying structured document is a new challenge both from IR 
and machine learning perspectives. For the former, flat text classifiers do not lead 
to natural extensions for structured documents, however there has been recently 
some interest in the classification of HTML pages. For the latter, the classifi- 
cation of structured data is an open problem since most classifiers have been 
designed for vector or sequence representations, and only a few formal frame- 
works allow to consider simultaneously content and structure informations. We 
briefly review below recent work in these different areas. 

The expansion of the Web has motivated a series of works on Web page 
categorization - viz. the last two Tree competitions [19]. Web pages are built 
from different type of information (title, links, text, etc) which play different 
roles. There has been several attempts to combine these information sources in 
order to increase page categorization scores ([5], [21]). Chakrabarti ([2]) proposes 
to use the information contained in neightboring documents of an HTML pages. 
All these approaches which deal only with HTML, propose simple schemes either 
for encoding the page structure or for exploiting the different types of information 
by combining basic classifiers. These models exploit a priori knowledge about the 
particular semantics of HTML tags, and as such cannot be extended to more 
complex languages like XML where tags may be defined by the user. We will see 
that our model does not exploit this type of semantics and is able to learn from 
data the importance of tag information. 

Some authors have proposed more principled approaches to deal with the 
general problem of structured document categorization. These models are not 
specific to HTML even when they are tested on HTML databases due to the lack 
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of a reference XML corpus. [4] for example propose the Hidden Tree Markov 
Model (HTMM) which is an extension of HMMs to a structured representa- 
tion. They consider tree structured documents where in each node (structural 
element), terms are generated by a node specific HMM. [16] have proposed a 
Bayesian network for classifying structured documents. This is a discriminative 
model which directly computes the posterior probability corresponding to the 
document relevance for each class. [22] present an extension of the Naive Bayes 
model to semi-structured documents where essentially global word frequencies 
estimators are replaced with local estimators computed for each path element. 
[18] propose to use Probabilistic Relationnal Models to classify structured doc- 
ument and more precisely Web pages. 

For the ad-hoc IR task, Bayesian networks (BN) have been used for infor- 
mation retrieval for some time. Inquery [1] retrieval engine operates on flat text 
while more recent proposal handle structured documents,e.g. [14], [15]. Out- 
side the field of information retrieval, some models have been proposed to han- 
dle structured data. The hierarchical HMM (HHMM) [6] is a generalization of 
HMMs to structured data, it has been tested on handwriting recognition and on 
the analysis of English sentences, similar HMM extensions have been used for 
multi-agent modeling [13]. However, inference and learning algorithms in these 
models are too computationally demanding for handling large IR tasks. The in- 
ference complexity for HHMM is 0{NT^) where N is the number of states in 
their HMM and T the length of the text in words, for comparison our model is 
more like 0{N + T) as will be seen later. 

The core model we propose is a generative model which has been developed 
for the categorization of any tree like document structure (typically XML doc- 
uments). This model bears some similarities with the one in [4], however, their 
model is adapted to the semantic of HTML documents and considers only the 
inclusion relation between two document parts. Ours is generic and can be used 
for any type of structured document. Even when tags do not convey semantic 
information, it allows considering different types of relations between structured 
elements: inclusion, depth in the hierarchical document, etc. This model could 
be considered as a special case of the HHMM [6] since it is simpler and since 
HHMM can be represented as particular BNs [13]. It is computationally much 
less demanding and has been designed for handling large document collections. 
This generative model is then extended into a discriminant one using the method 
of the Fisher Kernel. For that, we extend to the case of structured data the ideas 
initially proposed by [9] for sequences. 

Our main contributions are a new generative model for the categorization of 
large collections of structured documents and its extension via the use of Fisher 
kernels into a discriminant model. We also describe for the first time to our 
knowledge experiments on a large corpus of structured XML documents (INEX) 
developed in 2002 for ad-hoc retrieval tasks. 
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3 Document Structure 

We represent a structured document d as a Directed Acyclic Graph (DAG). 
Each node of the graph represents a structural entity of the document, and 
each edge represents a hierarchical relation between two entities (for example, 
a paragraph is included in a section, two paragraphs are on the same level of 
the hierarchy, etc). For keeping inference complexity to a reasonable level, we do 
not consider circular relations which might appear in some documents (e.g. Web 
sites), this restriction is not too severe since this definition already encompasses 
many different types of structured documents. 

Each node of the DAG is composed of a label (for example, labels can be 
section, paragraph, title and represent the structural semantic of a document) 
and a textual information (which is the textual content associated to this 
node if any). 

A structured document then contains three types of information: the logical 
structure information represented by the edges of the DAG (the position of 
the tag in an XML document), the label information (the name of the tag 
in an XML document) and the textual information. Figure 1 gives a simple 
example of structured document. 




Label : DOCUMENT 
Text : 



Label ; SECTION 

Text : "The second section is 
composed of one single paragrah"; 



label : PARAGRAPH 

Texte : "This is the third paragraph" 



Fig. 1. An example of structured document represented as a Direct Acyclic Graph. 
This document is composed of an introduction and two sections. Each part of the 
document is represented by a node with a label and a textual information. 



4 A Generative Model for Structured Documents 

We now present a generative model for structured documents. It is based on BNs 
and allows to handle these 3 types of information. This model can be used with 
any XML document without using a priori informations about the semantic 
of the structure. We first briefly introduce BNs and then describe the different 
elements of the model. 
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4.1 Notations 

We will use the following notations, let: 

- d: be a structured document 

- Sd- be the structure of document d. Sd = ({Sd}!P®('®d)) where {s^}ig[i,.|s^|] is 
the set of node labels (|sd| is the number of structured nodes for d), G 4 
with A the set of possible labels. pa{s'‘^ are the parents of node {s^} in the 
structured document and describe the logical structure information. 

- td'- be the textual information in d. td is a set {id}ie[i..|sd|] textual elements 
for each node i of the structured document. 

- {wh is the set of words of part in document d (|t^| is the number 

of words of part and is the kth word in ^ G y where V is the 

set of indexing terms in the corpus. 

4.2 Base Model Using Belief Networks 

Belief networks [11] are stochastic models for computing the joint probability 
distribution over a set of random variables. A BN is a DAG whose nodes are 
the random variables and whose edges correspond to probabilistic dependence 
relations between 2 such variables. More precisely, the DAG reflects conditional 
independence properties between variables, the joint probability of a set of vari- 
ables writes: 

P{xi, ...,Xn) = n P{xi/pa{xi)) where pa{xi) denotes the parents of Xi. 

i—l..n 

4.3 Generative Model Components 

We consider a structured document as the realization of random variables 
T and S corresponding respectively to textual and structural information. For 
simplicity, we will denote P{T = td, S = Sd/0) as P{td, Sd/0). Let 9 denotes the 
parameters of our document model, the probability of a document writes: 

P{d\9) = P{{td,Sd)\9) = P{sd\9)P{td\sd,9) (1) 

P{sd\9) is the structural probability of d and P{td/9) is the textual 
probability of d given its structure s. Each document will be modeled via a 
BN, whose nodes correspond either to tag or textual information and whose 
directed edges encode the relations between the document elements. The whole 
corpus will then be represented as a series of BN models, 1 per document. The BN 
model of a document can be thought of as a model of the structured document 
generation, where the generation process goes as follows: someone who wants 
to create a document about a specific topic will sequentially and recursively 
create the do cument organization and then All the corresponding nodes with 
text. For example he first creates sections after what, for each section, he creates 
subsections etc... recursively. At the end, in each “terminal” node, he will create 
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the textual information of this part as a succession of words. This is a typical 
generative approach which extends to structured information the classical HMM 
approach for modeling sequences. The two components-structure and content- 
of the model are detailed below. 



Structural Probability 

The structural information of a document is encoded into the edges of the BN. 
Under the conditional independence assumption of the BN document model, the 
structural density of document d writes: 

|Sd| 

P{s,\e) = Y[P{s:,\pais:,)) (2) 

i=l 

The BN structural parameters are then the {P(Srf|pa(Srf))} which are the 
probabilities to observe given its parents pa(s^) in the BN. 

In order to have a robust estimation of the BN parameters, we will share sets 
of parameters among all the document models. We will make the hypothesis that 
the {P{s^Jpa{Pci))} only depend on the labels of nodes and pa{s'‘^, i.e. two 
nodes in two different document models which share the same label and whose 
parents also share the same labels will have the same conditional probability 
P{Pd\P0-{s]i))- 

Within this framework, several BN models may be associated to a document 
d. Figure 2 illustrates two of the models we have been working with. The DAG 
structure of Model 2 is copied from the tree structure of the document and 
reflects only the inclusion relation. The same type of relation is used in [4]. Model 
1 contains both inclusion information (vertical edges) and sequence information 
(horizontal edges). Both models are an overly simplified representation of the 
real dependencies between document parts. This allows to keep the complexity 
of learning and inference algorithms low. Statistical models that work best are 
often very simple compared to the underlying phenomenon (e.g. naive Bayes in 
text classification or Hidden Markov Models in speech recognition), practioners 
of BNs have experienced the same phenomenon. Note that other instances of 
our generic model could have also been used here. 



Textual Probability 

For modeling the textual content of a structured document, we make the follow- 
ing hypothesis: 

- the probability of a word only depends on the label of the node that contains 
this word (first order dependency assumption). 

- in a node, words are independent (Naive Bayes assumption) 

The naive Bayes hypothesis is not mandatory here and any other term gener- 
ative model (e.g. HMM) could be used instead, however this hypothesis allows for 
a robust density estimation and in our experiments more sophisticated models 
did not led to any performance improvement. 
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MODEL 1 MODEL 2 

Fig. 2. Two possible structural belief networks constructed for the document presented 
in figure 1. 



For a particular part of document d, we then have: 

K\ 

P{fjsa, 9) = 0) = n P{<k\^d. 0) (3) 

fc=i 

And for the entire document, we have: 

i=|sd| hdl 

p{u\sd,9)= n n^Kfci4,^) (4) 

i=i k=i 



Final Belief Network 

Combining equations 2 and 4, we get a generative structured document model 

l«dl / 

P(4|pa(4),0) 

Equation 5 describes the contribution of structural and textual information 
in our model. 

4.4 Learning 

This model is completely defined by two sets of parameters, transition and emis- 
sion probabilities respectively denoted by P(si\sj) and P{wi\sj): 

In order to learn the 9, we use the Expectation Maximization (EM) algorithm 
for optimizing the maximum likehood of the data. Since evidence is available for 



p{d\9) = n 



kill 

n 



40) 



(5) 
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any variable in the BN model of a document, this simply amounts to a count for 
each possible value of the random variables. 



Using equation 5, the log-likehood for all documents in the corpus D is: 









l*dl 



^ = EE 

d^D \ 






(6) 



Let us denote the model parameters P(s^|pa(s^)) and fe|Sd) by s*^,pa{s*^) 
and ^ . Equation 6 then writes: 



E 

dGD 



/kdl \ 

( E^°S (^s\,pa(s\) 1 



bdi hiii 

EE^og^- 

^ i—l k—1 




(7) 



The learning algorithm then solves q§^ = 0 with the constraint ^ 0n,m = 1- 

n,m ^ 

Let ^ be the number of times a part with label n has his parent with 
label m in document d or respectively the number of times a word with value n 
is in a part with label m for document d, the solution of the learning problem 
is: 



n^m 

) _ 

n,m “ J^d 

i dGD 



(8) 



The complexity of the algorithm is 0( X) |sd| + |id|)- In a classical structured 

deD 

document, the number of node of the structural network is smaller than the 
number of words of the document. So the complexity is equivalent to 0( ^ |td|) 

d&D 

which is the classical learning complexity of the Naive Bayes algorithm. Note 
that, in the case of flat documents, our model is strictly equivalent to the classical 
Naive Bayes model. 



5 Discriminant Approach 

The above model could be used for different tasks, e.g. document classification 
or clustering or even for performing more sophisticated inferences on document 
parts, e.g. deciding which part of a document is relevant for a specific topic. For 
classification of whole documents which is the focus of this paper, discriminant 
approaches are most often preferred to generative ones since they usually score 
higher. We then propose below to derive from the generative model of structured 
document a discriminant model. For that, we follow the line of [9] who proposed 
to build a discriminant model from a generative sequence model. We show how 
this idea could be extended to our generative structured document model. 
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5.1 Fisher Kernel 

Given a generative model with parameters 9 for sequences, [9] propose to com- 
pute for each sequence x the Fisher score Ud = S/elogP{x/9) of the model for the 
sequence, i.e. the gradient of the log likehood of x for model 9. For each sequence 
sample, this score is a vector of fixed dimensionality which explains how each 
parameter of the generative model contributes to generate the sequence. 

Using this score, they then define a distance between two examples x and y 
as a kernel function: 

K{x, y) = C/J M~^Uy with M = Ex[U]^Ux] (9) 

This kernel can then be used with any kernel classifier, (e.g. SVM) in order to 
classify the examples. The key idea here is to map the sequence information onto 
a vector of scores. This allows to make use of any classical vector discriminant 
classifier on this new representation and therefore to use well known and efficient 
vector classifiers for sequence classification. We show below that this idea may 
be naturally adapted to our structured generative model. 



5.2 Fisher Kernel for the Structured Document Model 



For our model, the Fisher Kernel can be easily computed. Using 7 we get: 



dP{d/9) 

^9n^m 






(10) 



The Fisher kernel idea initially proposed for HMMs, naturally carries over 
to our structured data model. However, in practice, using the Fisher Kernel 
method is not straightforward. In order to make the method work, one must 
make different approximations, especially when the number of parameters of the 
generative model is high which is the case here. In our implementation, we make 
the following approximations: 



- we first approximate the M matrix using the identity matrix like in [9] 

- we then compute the gradient of the log likehood wrt ‘1^9n^m like in [8]. 



Let 



= 2v^ 



we have: 



dP{d/9) _ r>^n 
9pn.m “ ^ Pn 



N„ 



We use this last formula to compute the vector corresponding to each struc- 
tured document d. 



6 Experiments 

6.1 Corpora 

We use two corpora in our experiments. 

WebKB corpus [20] is composed of 8282 HTML documents from computer 
science departments web sites. This is a reference corpus in the machine learning 
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community for classifying HTML pages. It is composed of 7 topics (classes): 
student, faculty, course, project, department, staff, other. Other is a trash topic, 
and has been ignored here as it is usually done. We are then left with 4520 
documents. We used Porter Stemming and pruned all words that appear in less 
than 5 documents. The size of the vocabulary V is 8038 terms. We only keep the 
tags with the higher frequency {HI, H2,H3, TITLE, B, I, A). We made a 5-fold 
cross-validation (80% on train and 20% on test). 

INEX corpus [7] is the new reference corpus for Information Retrieval with 
XML documents. It was designed for ad-hoc retrieval. It is made up of articles 
from journals and proceedings of the IEEE Computer Society. All articles are 
XML documents. The collection contains approximately 15 000 articles from over 
20 different journals or proceedings. We used Porter Stemming and pruned the 
words which appear in less than 50 documents. The final size of the vocabulary 
is about 50 000 terms and the number of tags is about 100. We made a random 
split using 50% for training and 50% for testing. The task was to classify articles 
into the right journal or proceedings (20 classes). 



6.2 Results 

We have used a Naive Bayes classifier as a baseline generative classifier and 
SVM ([10]) as a baseline discriminant model. Results appear in figures 4 and 3. 
Macro-average is obtained by averaging the percentage of correct classification 
for every class considered. Micro-average is obtained by weighting the average 
by the relative size of each class. 

Let us consider the micro-average. On WebKB, the BN model achieves a 
mean 3 % improvement with regard to Naive Bayes. This is encouraging and 
superior to already published results on this dataset [4]. The Fisher model still 
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Fig. 3. Performance of 5 classifiers on WebKB corpus. 
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Fig. 4. Performance of 5 classifiers on INEX corpus. 
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rises this score by 4%. This corresponds to 2% more than the baseline discrimi- 
nant SVM. The structured generative document model is clearly superior to the 
flat Naive Bayes classifler, and the Fisher Kernel operating on the structured 
generative models compares well to the baseline SVM. 

On the much larger INEX database, our generative model achieves about 2% 
micro-average improvement with regard to Naive Bayes and the Fisher Kernel 
method increases the BN score by about 6%, but only 1% compared to the 
baseline SVM. This confirms the good results obtained on WebKB. Note that, 
to our knowledge, these are the first classification results obtained on a real world 
large XML corpus. 

These experiments show that it is important to take simultaneously into 
account structure and content information in HTML or XML documents. The 
proposed methods allow to model and combine the two types of information. 
Both the generative and discriminant models for structured documents offer a 
complexity similar to that of the baseline flat classification models while increas- 
ing the performances. 

7 Conclusion and Perspectives 

We have proposed a new generative model for structured textual documents 
representation. This model offers a general framework for handling different 
tasks like classification, clustering or more specialized structured document ac- 
cess problems. We focused here on the classification of whole documents and 
described how to extend this generative model into a more efficient discriminant 
classifler using the Fisher kernel idea. Experiments performed on two databases 
show that the proposed methods are indeed able to take simultaneously into 
account the structure and content informations and offer good performances 
compared to baseline classifiers. 
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Abstract. Various different algorithms for learning Bayesian networks from data 
have been proposed to date. In this paper, we adopt a novel approach that com- 
bines the main advantages of these algorithms yet avoids their difficulties. In our 
approach, first an undirected graph, termed the skeleton, is constructed from the 
data, using zero- and first-order dependence tests. Then, a search algorithm is 
employed that builds upon a quality measure to find the best network from the 
search space that is defined by the skeleton. To corroborate the feasibility of our 
approach, we present the experimental results that we obtained on various differ- 
ent datasets generated from real-world networks. Within the experimental setting, 
we further study the reduction of the search space that is achieved by the skeleton. 



1 Introduction 

The framework of Bayesian networks has proven to be a useful tool for capturing and 
reasoning with uncertainty. A Bayesian network consists of a graphical structure, en- 
coding a domain’s variables and the probabilistic relationships between them, and a nu- 
merical part, encoding probabilities over these variables (Cowell et ah, 1999). Building 
the graphical structure of a Bayesian network and assessing the required probabilities 
by hand is quite labour-intensive. With the advance of information technology, however, 
more and more datasets are becoming available that can be exploited for constructing 
a network automatically. Learning a Bayesian network from data then amounts to find- 
ing a graphical structure that, supplemented with maximum-likelihood estimates for its 
probabilities, most accurately describes the observed probability distribution. 

Most state-of-the-art algorithms for learning Bayesian networks from data take one 
of two approaches: the use of (in)dependence tests (Rebane and Pearl, 1987; Spirtes 
and Glymour, 1991; de Campos and Huete, 2000) and the use of a quality measure 
(Cooper and Herskovits, 1992; Buntine, 1991; Heckerman et ah, 1995; Lam and Bac- 
chus, 1993). Although with both approaches encouraging results have been reported, 
they both suffer from some difficulties. With the first approach, a statistical test such as 
is employed for examining whether or not two variables are dependent given some 
conditioning set of variables; the order of the test is the size of the conditioning set 
used. By starting with zero-order tests and selectively growing the conditioning set, in 
theory all (in)dependence statements can be recovered from the data and the network 
that generated the data can be reconstructed. In practice, however, the statistical test 
employed quickly becomes unreliable for higher orders, because the number of data 
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required increases exponentially with the order. If the test would then return incorrect 
(in)dependence statements, errors could arise in the graphical structure. With the second 
approach, a quality measure such as MDL is used for assessing the quality of candidate 
graphs. The graphical structure yielding the highest score then is taken to he the one 
that best explains the observed data. This approach suffers from the size of the search 
space. To efficiently traverse the huge space of graphical structures, often a greedy 
search algorithm is used. Other algorithms explicitly constrain the space by assuming a 
topological ordering on the nodes of candidate structures. Both types of algorithm may 
inadvertently prune high-quality networks from the search space of structures. 

In this paper, we adopt a novel approach to learning Bayesian networks from data 
that combines the main advantages of the two approaches outlined above. In our ap- 
proach, first an undirected graph is constructed from the data using just zero- and first- 
order dependence tests. The resulting graph, termed the skeleton of the network under 
construction, is used to explicitly restrict the search space of graphical structures. In the 
second phase of our approach, a search algorithm is employed to traverse the restricted 
space. This algorithm orients or removes each edge from the skeleton to produce a di- 
rected graphical structure. To arrive at a fully specihed Bayesian network, this structure 
is supplemented with maximum-likelihood estimates computed from the data. We ex- 
perimented with two instances of our approach, building upon a simple hill-climber 
and upon a genetic algorithm, respectively, for the search algorithm. The results that we 
obtained compare favourably against various state-of-the-art learning algorithms. 

The paper is organised as follows. In Section 2, we present the details of our ap- 
proach. In doing so, we focus on the construction of the skeleton; an in-depth discus- 
sion of the design of a competent search algorithm for the second phase of our approach 
is presented elsewhere (van Dijk et ah, 2003). The experimental results obtained with 
our approach are reported in Section 3. We analyse various properties of the skeleton in 
Section 4. We end the paper with a discussion of our approach in Section 5. 



2 Skeleton-Based Learning 

Our approach divides the task of learning a Bayesian network from data into two phases. 
In the hrst phase, a skeleton is constructed. This skeleton is taken as a template that 
describes all graphical structures that can be obtained by orienting or deleting its edges. 
In the second phase, the search space that is dehned by the skeleton is traversed by 
means of a search algorithm. Focusing on the hrst phase, we discuss the construction of 
the skeleton in Section 2.1; in Section 2.2, we briehy review related work. 

2.1 Constructing the Skeleton 

We consider learning a Bayesian network from a given dataset. For ease of exposition, 
we assume that this dataset has been generated by sampling from a network whose 
graphical structure perfectly captures the dependences and independences of the repre- 
sented distribution. The undirected graph underlying the structure of this network will 
be referred to as the true skeleton. In the hrst phase of our approach, we construct a 
skeleton from the available data to restrict the search space for the second phase. In 




134 



Steven van Dijk, Linda C. van der Gaag, and Dirk Thierens 




Fig. 1. (a) Two non-neighbouring variables Y\ and T 2 that are dependent yet become independent 
given 3 l. (b) The separation graph G{X) ofX, withL(X) = {Z,Ti,T2}- The variable T 2 is identified 
as a neighbour of X since it has no incoming arcs. The arc from Y 2 to Z indicates Z to be a non- 
neighbour. Removal of the arc from Z to Tj reveals Y\ to be a neighbour of X. 



doing so, we aim to find a skeleton that is already close to the true skeleton. On the 
one hand, we try to avoid missing edges, because these could prune the best network 
from the search space. On the other hand, we try to minimise the number of additional 
edges, since these would unnecessarily increase the size of the space to be traversed. To 
construct an appropriate skeleton, we analyse the dependences and independences that 
are embedded in the dataset for the various different variables. To this end, a statistical 
test is employed. Well-known examples of such tests are the statistic and the mutual 
information criterion. In the sequel, we will write DT{X,Y \ Z) if, for a given threshold 
value, the test indicates that the variables X and Y are dependent given the (possibly 
empty) conditioning set of variables Z; otherwise, we write ~^DT{X,Y \ Z). 

When constructing the skeleton, we try to identify the true neighbours of each vari- 
able X. To this end, we begin by identifying all variables that have a zero-order depen- 
dence on X. If for a specific variable Y, the test employed fails to report a result, for 
example due to a lack of data, we assume that Y is independent of X. We now observe 
that, while neighbouring variables in the true skeleton are always dependent, the reverse 
does not hold; two dependent variables may be separated by one or more intervening 
variables, for example as in Figure 1(a). The list L(Z) = {T | DT{X,Y)} obtained there- 
fore includes neighbours as well as non-neighbours of X from the true skeleton. Since 
a non-neighbour Y of Z is separated from X in the true skeleton by a set of true neigh- 
bours of X, we expect that ~^DT(X,Y \ Z) for some set Z C L{X) \ {T}. We now use 
first-order tests to remove, from among the list L{X), any non-neighbours of X. The 
skeleton then in essence is found by adding an edge between X and Y G L{X) if and 
only \YDT{X,Y \ {Z}) holds for all Z G L{X) \ {T}. 

We note that using just first-order tests as outlined above, does not suffice for 
identifying all non-neighbours of X from among the list L{X). In fact, a higher-order 
test may be required to establish a variable T as a non-neighbour of Z; for exam- 
ple, if DT{X,Y I {Zi}), DT{X,Y \ {Z 2 }), and ^DT{X,Y \ {Zi,Z 2 }), a second-order 
test is needed for this purpose. By using higher-order tests, therefore, additional non- 
neighbours could be identified and a sparser skeleton could result. As we have argued 
before, however, the test employed quickly becomes unreliable for larger conditioning 
sets, thereby possibly giving rise to errors in the skeleton. Since the purpose of the 
skeleton is to safely reduce the search space, we restrict the tests employed to just zero- 
and first-order tests, and let the search algorithm remove the spurious edges. 
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We further note that the method described above, when applied straightforwardly, 
could erroneously remove some true neighbours of a variable X from the list L{X). 
As an example we consider the neighbours Yi and Y 2 of X in the true skeleton, where 
Y 2 has Z for a second neighbour. Now, if the true neighbour Y\ would exhibit a weak 
dependence on X and the non-neighbour Z would exert a very strong influence on X, 
then values for Z could hide the dependence of Yi on X. The dependence test then re- 
turns ^DT{X,Yi I {Z})andTi would be identified as a non-neighbour of A. To support 
identification of the true neighbours of X, therefore, we construct an auxiliary directed 
graph termed the separation graph G(X) of X. The variables from the list L(X) are the 
nodes of the graph. There is an arc from a variable Z to a variable Y if ^DT{X,Y \ {Z}) 
for some Z G L{X) \ {T}. If the first-order test fails to establish dependence or inde- 
pendence, we assume dependence. In the separation graph G{X), all variables without 
any incoming arcs are true neighbours of X, since these variables remain dependent on 
X regardless of the conditioning set used. The thus identified neighbours are used fo 
find non-neighbours of X, by following their outgoing arcs. The outgoing arcs of these 
non-neighbours are removed, which may cause other variables to reveal themselves as 
neighbours. Figure 1(b) illustrates the basic idea of the separation graph. The process 
of identifying neighbours and non-neighbours is repeated until no more neighbours of 
X can be identified. Variables that are part of a cycle in the remaining separation graph 
are all marked as neighbours of X : these variables correspond to ambiguous situations, 
which are thus resolved safely, that is, without discarding possible neighbours. We note 
that the process of identifying neighbours requires at most |L(A) | iterations. The skele- 
ton is now built by finding the neighbours of every variable and connecting these. 

2.2 Related Work 

de Campos and Huete (2000) use a skeleton within a test-only approach. The skeleton 
is built by connecting variables for which no zero- or first-order test indicates inde- 
pendence. From the thus constructed skeleton, a directed graphical structure is derived 
without employing a search algorithm. The authors do suggest the use of such an algo- 
rithm, however. Cheng et al. (2002) also present a test-only approach that builds upon 
a skeleton constructed from lower-order dependence tests. Steck and Tresp (1999) deal 
with the construction of a usable skeleton when unreliable tests offer conflicting depen- 
dence statements. The Hybrid Evolutionary Programming (HEP) algorithm by Wong 
et al. (2002) takes an approach that is closely related to ours. Although the algorithm 
does not explicitly construct a skeleton, it does use zero- and first-order dependence 
tests to restrict the search space of graphical structures that is subsequently traversed 
by an MDL-based search algorithm. The HEP algorithm has shown high-quality per- 
formance on datasets generated from the well-known Alarm network. 

3 Experiments 

Our approach to learning Bayesian networks allows for various different instances. Eor 
the first phase, different dependence tests can be employed and for the second phase, 
different quality measures and different search algorithms can be used. We present two 
such instances in Section 3.1 and report on their performance in Section 3.2. 
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3.1 The Instances 

To arrive at an instance of our approach, we have to specify a dependence test to be 
used in the construction of a skeleton as outlined in the previous section. In our exper- 
iments, we build upon the test, using independence for its null-hypothesis. The 
test calculates a statistic from the contingency table of the variables X and Y concerned: 

s,al{XJ) = YPi-^. 

where Oij = p{xi,yj) ■ N is the observed frequency of the combination of values {xi,yj) 
and Eij = p{xi) ■ p{yj) -N is the expected frequency of (xi,yj) if X and Y were indepen- 
dent; X denotes the size of the available dataset and p{xi) denotes the proportion of xi. 
The computed statistic is compared against a critical value s with ~ ^ 

given threshold e and the z]/ distribution with df degrees of freedom. If the statistic is 
higher than s, the null-hypothesis is rejected, that is, X and Y are established as being 
dependent. The test for dependence of X and Y given Z is defined analogously, taking 
independence of X and Y for every possible value of Z for the null-hypothesis. In our 
experiments, we use the threshold values £o = 0.005 for the zero-order dependence test 
and £i = 0.05 for the second-order dependence test. Choosing the z^ test allows for a 
direct comparison of our approach against the HEP algorithm mentioned above. 

With the specihcation of a dependence test and its associated threshold values, the 
hrst phase of our approach has been detailed. To arrive at a fully specified instance, we 
now have to detail the search algorithm to be used for traversing the space of graphical 
structures and the quality measure it employs for comparing candidate structures. In our 
experiments, we use the well-known MDL quality measure (Lam and Bacchus, 1993). 
This measure originates from information theory and computes the description length 
of a Bayesian network and a given dataset; the description length equals the sum of 
the size of the network and the size of the dataset after it has been compressed given 
the network. While a more complex network can better describe the data and hence 
compress it to a smaller size than a simpler network, it requires a larger encoding to 
specify its arcs and associated probabilities. The best network for a given dataset now 
is the network that best balances its complexity and its ability to describe the data. 

In our experiments, we further use two different search algorithms: a simple hill- 
climber and a genetic algorithm. The hillclimber sets out with the empty graph. In each 
step, it considers all pairs of neighbouring nodes from the skeleton and all possible 
changes to the graph under construction, that is, remove, insert, or reverse the consid- 
ered arc; it then selects the change that improves the MDL score the most. This process 
is repeated until the score cannot be further improved. The genetic algorithm builds 
upon an encoding of graphical structures by strings of genes. Each gene corresponds 
with an edge in the skeleton and can be set in one of three states, matching absence 
and either orientation of the edge. A special-purpose recombination operator is used to 
guarantee good mixing and preservation of building blocks (van Dijk et ah, 2003). 
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Table 1. Results of the experiments. The top part shows in the first column the MDL score of 
the original network, averaged over five datasets. The second column shows the results from the 
GA with the true skeleton, averaged over five datasets and five runs per dataset. The third column 
gives the results from the hillclimher on the true skeleton, averaged over five datasets. The bottom 
part lists the results obtained with the GA, with the HEP algorithm, and with the hillclimher. 





Original {score ± sd) 


uA-i-true yscore ± sd) 


HC-l-true (score) 


Alarm-250 


7462. 14± 197.20 


5495.05i0.02 


5645.56 


Alarm-500 


10862.49T 196.59 


9272.49i0 


9519.90 


Alarm-2000 


31036.54±236.00 


30214.72i0 


31677.65 


Alarm- 10000 


138914.66±764.09 


138774.32i0 


145564.45 


Oesoca-250 


10409.23T 88.20 


7439.62i0 


7535.68 


Oesoca-500 


15692.60± 154.25 


13034.87i0.19 


13274.75 


Oesoca-2000 


46834.31±244.07 


45115.91i2.77 


45542.13 


Oesoca- 10000 


213253.55±387.12 


212364.09il.23 


213204.67 





GA (score i sd) 


HEP (score i sd) 


HC (score) 


Alarm-250 
Alarm-500 
Alarm-2000 
Alarm- 10000 


5566.71i0.20 

9458.03il.62 

30563.03i2.29 

138955.54i72.36 


5523.50i 11.74 
9260.87i35.97 
30397.20i 161.86 
139499.51i530.41 


5654.52 

9619.94 

31727.86 

144677.31 


Oesoca-250 
Oesoca-500 
Oesoca-2000 
Oesoca- 10000 


7703.43i0.82 
13282.38i0.07 
45335. 16i2.12 
212544.21i27.47 


9619.25i0.10 

15388.89il6.75 

46504.41i40.18 

212446.04i276.97 


7743.33 

13474.23 

45861.42 

214942.04 



3.2 Experimental Results 

We studied the two instances of our approach outlined above and compared their per- 
formance against that of the HEP algorithm. We used datasets that were generated 
by means of logic sampling from two real-world Bayesian networks. The well-known 
Alarm network was built to help anesthetists monitor their patients and is quite com- 
monly used for evaluating the performance of algorithms for learning Bayesian net- 
works. The Oesoca network was developed at Utrecht University, in close collaboration 
with experts from the Netherlands Cancer Institute; it was built to aid gastroenterolo- 
gists in assessing the stage of oesophageal cancer and in predicting treatment outcome. 
Table 1 summarises the results obtained for datasets of four different sizes for each 
network. The results for the genetic algorithm and for the HEP algorithm are averaged 
over five different datasets and five runs of the algorithm per dataset; the results for 
the hillclimher are averaged over the five datasets. Depending on the size of the data 
set, running times ranged from two to 80 minutes for the GA, up to 30 minutes for the 
hillclimher, and up to seven minutes for the HEP algorithm. Calculation of the skeleton 
could take up to 50% of the total time of a run with the GA. 

The bottom part of Table 1 shows that all three algorithms under study perform 
quite well. The table in fact reveals that the algorithms often yield a network that has 
a lower score than the original network, whose score is shown in the top part of the 
table. The fact that the original network may not be the one of highest quality can be 




138 



Steven van Dijk, Linda C. van der Gaag, and Dirk Thierens 



attributed to the datasets being finite samples. Since the datasets are subject to sampling 
error, they may not accurately reflect all the (in)dependences from the original network. 
The distribution observed in the data may then differ from the distribution captured 
by the original network. From the bottom part of the table we further observe that 
the genetic algorithm and the HEP algorithm perform comparably. The small standard 
deviation revealed by the genetic algorithm indicates that it is likely to always give 
results of similar quality. Since the HEP algorithm reveals much more variation, the 
genetic algorithm may be considered the more reliable of the two algorithms. 

The top part of Table 1 summarises the results obtained with the GA and with the 
hillclimber when given the true skeleton rather than the skeleton constructed from the 
data. We note that only a slight improvement in quality results from using the true 
skeleton. Prom this observation, we may conclude that, for practical purposes, the con- 
structed skeleton is of high quality. The good performance of the hillclimber, moreover, 
is an indication of how much the learning task benefits from the use of the skeleton. 



4 Analysis of Our Approach 

We recall from Section 2 that our approach divides the task of learning a Bayesian 
network from data into two phases. In the first phase, a skeleton is constructed that 
is taken as a specihcation of part of the search space of graphical structures. In the 
second phase, the specified subspace is traversed by a search algorithm. The feasibility 
of our approach depends to a large extent on the properties of the computed skeleton. 
Pirst of all, to avoid pruning optimal solutions from the search space, there should 
be no edges of the true skeleton missing from the computed skeleton. Secondly, there 
should be few additional edges: the more densely connected the computed skeleton is, 
the less feasible it is to traverse the specified subspace of graphical structures. Since 
it is very hard to prove theoretical results about the computed skeleton, we opt for an 
experimental investigation of its properties. In the subsequent sections, we compare the 
computed skeleton against the true skeleton in increasingly realistic situations. 

4.1 Use of a Perfect Oracle 

To investigate by how much a computed skeleton can deviate from the true skeleton, 
we performed an experiment in which we precluded the effects of sampling error and 
of inaccuracy from dependence tests. To this end, we constructed an oracle that reads 
the (in)dependences tested for from the structure of the original network. Por the Alarm 
and Oesoca networks, we thus computed two skeletons each. Por the first skeleton, we 
used zero-order dependence tests only: we connected each variable X to all variables 
having an unconditional dependence on X. Por the construction of the second skeleton, 
we used zero- and first-order tests as outlined in Section 2. Table 2 reports the numbers 
of additional edges found in the computed skeletons compared against the true skeleton; 
the table further includes the results for a skeleton consisting of the complete graph. 

Since we used a perfect oracle to establish dependence or independence, the com- 
puted skeletons include all edges from the true skeletons: there are no edges missing. 
Table 2 therefore gives insight in the reduction of the search space for the second phase 
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Table 2. Numbers of additional edges found in the skeletons constructed using an oracle. Results 
are shown for the complete skeleton, for the skeleton built by using just zero-order tests, and 
for the skeleton computed by the proposed approach. The true skeleton of the Alarm network 
includes 46 edges; the true skeleton of the Oesoca network includes 59 edges. 





Alarm 


Oesoca 


Complete skeleton 


620 


802 


Zero-order skeleton 


255 


728 


Computed skeleton 


58 


116 



of our approach, that is achieved under the assumption that the strengths of the depen- 
dences do not affect the results from the dependence test. The table reveals that, under 
this assumption, sizable reductions are found. Note that further reductions could have 
been achieved by using higher-order dependence tests. Such tests, however, would have 
increased the computational demands for the construction of the skeleton. Moreover, in 
practice such tests would have quickly become highly unreliable. 

4.2 Use of the Test on a Perfect Dataset 

In the experiments described in Section 3, we used the test for studying dependence. 
We recall that, for two variables X and Y, the x~^ test calculates a statistic stat{X,Y). 
This statistic is compared against a critical value s to decide upon acceptance or rejec- 
tion of the null-hypothesis of independence of X and Y. The critical value depends upon 
the threshold e and upon the degrees of freedom df of the x^ distribution used. Writing 
f{e,df) for the function that yields the critical value s, we have that the test reports 
dependence if stat{X,Y) > f{e,df). Hence, 

N-c>f{e,df), 

where c is a constant that depends upon the marginal and joint probability distributions 
over X and Y. The choice of the threshold e now directly influences the topology of the 
computed skeleton. If the threshold is set too low with respect to the size of the available 
dataset, weak dependences will escape identification and the skeleton will have edges 
missing. On the other hand, if the threshold is too high, coincidental correlations in the 
data will be mistaken for dependences and the skeleton will include spurious edges. 

To study the impact of the thresholds used with the x^ test, we performed some ex- 
periments from which we precluded the effects of sampling error. For this purpose, 
we constructed virtual datasets that perfectly capture the probability distribution to 
be recovered: for the proportions p{C) reflected in these datasets, we thus have that 
p{C) = p{C), where p{C) is the true distribution over the variables C. For the Alarm 
and Oesoca networks, we constructed various skeletons from virtual datasets of differ- 
ent sizes, using different thresholds. In our first experiment, we focused on skeletons 
that were constructed from zero-order dependence tests only. Figures 2(a) and 2(b) 
show the numbers of edges from the true skeletons of the Alarm network and of the Oe- 
soca network, respectively, that are missing from these zero-order skeletons. Figure 2(a) 
reveals that the Alarm network consists of quite strong dependences that are effectively 
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Number of missing edges 




Fig. 2. (a) Results for the Alarm network, (b) Results for the Oesoca network. Shown are the 
numbers of edges missing from the skeleton built from just zero-order dependences. 





(a) (b) 

Fig. 3. (a) Results for the Alarm network, (b) Results for the Oesoca network. Shown are the 
numbers of edges missing from the skeleton built using zero- and first-order dependence tests. 



found, even with small thresholds and small datasets. Figure 2(b) shows that a similar 
observation does not apply to the Oesoca network. This network models a very weak 
dependence that is only found with a threshold equal to zero, even for a dataset of size 
10000. The other relatively weak dependences modelled by the network, are recovered 
by a tradeoff between the threshold used and the size of the dataset under study. 

We further constructed skeletons using zero- and first-order tests, as described in 
Section 2. Once again, we used virtual datasets of different sizes and employed dif- 
ferent thresholds for the first-order dependence test. For the zero-order test, we used, 
for the Alarm network, the highest threshold with which all edges from the true skele- 
ton were recovered; for the Oesoca network, we used the highest threshold with which 
all dependences except the two weakest ones were found. Figures 3(a) and 3(b) show 
the numbers of edges from the true skeletons of the Alarm network and of the Oe- 
soca network, respectively, that are missing from the thus computed skeletons. Figure 
3(a) shows that the true skeleton of the Alarm network is effectively recovered, with 
almost all thresholds and dataset sizes. Figure 3(b) shows, once again, that the weaker 
dependences modelled by the Oesoca network are only found by a tradeoff between the 
threshold and the size of the dataset used. 

Where Figure 3 shows the numbers of edges missing from the skeletons constructed 
using zero- and first-order tests. Figure 4 shows the numbers of edges in these skeletons 
that are absent from the true skeletons. Figures 4(a) and 4(b) thus show the numbers 
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Fig. 4. fa) Results for the Alarm network, (b) Results for the Oesoca network. Shown are the 
numbers of spurious edges in the skeleton built using zero- and first-order dependence tests. 



of spurious edges in the skeletons constructed for the Alarm and Oesoca networks, re- 
spectively. Figure 4(b) especially illustrates the tradeoff between recovering (almost) 
all edges of the true skeleton and excluding spurious ones. We further observe that the 
landscapes of the Figures 3(b) and 4(b) are not monotonically increasing or decreasing: 
Figure 3(b) reveals a valley and Figure 4(b) shows a ridge. These non-monotonicities 
are caused by a very weak dependency in the Oesoca network. With higher thresholds, 
the weakness of the dependency forestalls its identihcation, thereby hiding a true neigh- 
bour. Upon lowering the threshold, however, the neighbour is identihed and thereby 
effectively changes the separation graph, which causes the observed ridge and valley. 

Since we used virtual datasets, we precluded from our experiments the effects of 
sampling error. The Figures 3 and 4, therefore, give insight in the ability of our approach 
to construct a skeleton that is already close to the true skeleton, under the assumption 
that the dataset used perfectly captures the probability distribution to be recovered. The 
hgures reveal that, under this assumption, most dependences are recovered with small 
threshold values, giving rise to good skeletons. We found, however, that carefully hand- 
crafted, real-world networks may embed very weak dependences that would require 
high thresholds for their recovery from data. 

4.3 Use of with Sampled Datasets 

The last, and most realistic situation that we address, involves datasets that were gen- 
erated by means of logic sampling from a network under study. We recall that we used 
such datasets in our main experiments described in Section 3. We observe that sampled 
datasets differ from virtual datasets in two important aspects. Firstly, generated datasets 
show the effects of sampling errors, that is, the distribution observed in the dataset may 
differ slightly from the original distribution. Secondly, as generated datasets are finite, 
the dependence test used can fail to reliably establish dependence or independence. In 
our experiments, we adopted the common rule of thumb that the result of the x^ test can 
be considered reliable only if all cells in the contingency tables have expected frequen- 
cies larger than hve. As before, we compared the skeletons computed from the sampled 
datasets against the true skeletons of the Alarm and Oesoca networks. The numbers of 
missing and additional edges are summarised only briefly due to space restrictions. 

The differences between the true skeletons and the skeletons constructed from the 
sampled datasets, strongly depended upon the sizes of the dataset used and upon the 
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thresholds employed. For low thresholds, the differences found were relatively mod- 
est. For example, with datasets of size 10000 sampled from the Oesoca network and 
with thresholds equal to 0.05, the average number of edges missing from the computed 
skeleton was 3 ; the average number of additional edges was 8 1 . With virtual datasets 
of the same size, these numbers were 2 and 59, respectively. With smaller datasets, 
these differences became larger. For example, with datasets of size 250, using the same 
thresholds, the average number of edges missing from the computed skeletons was 23.6; 
the average number of additional edges was 86.2. With virtual datasets of the same 
size, we found these numbers to be 11 and 22, respectively. The thresholds used were 
found to have a much stronger impact on the numbers of missing and additional edges. 
With datasets of size 2000, for example, raising the thresholds from £o = £i = 0.05 to 
£o = 0. 1 , £i = 0.4 served to double the size of the resulting skeleton. 

From the above observations, we conclude that in realistic situations the thresholds 
used with the dependence test should be set of a relatively low value since using more 
liberal thresholds would result in an unfavourable tradeoff between the size of the re- 
sulting skeleton and its number of missing edges. We further conclude that sampling 
error can cause substantial deviations of the computed skeleton from the true one. 



5 Discussion 

Most state-of-the-art algorithms for learning Bayesian networks from data build upon 
either the use of (in)dependence tests or the use of a quality measure and search algo- 
rithm. While important progress has been made with both approaches, we have argued 
that there are some obstacles to their practicability. Within the first approach, for ex- 
ample, the statistical test employed quickly becomes unreliable for larger conditioning 
sets. The second approach suffers from the huge space of graphical structures to be 
traversed. We have proposed a novel approach that combines the main advantages of 
these earlier algorithms yet avoids their difficulties. In the first phase of our approach, 
we use zero- and first-order dependence tests to build an undirected skeleton for the net- 
work under construction. This skeleton is used to explicitly restrict the search space of 
directed graphical structures to promising regions. Then, a search algorithm is used to 
traverse the restricted space to find a high-quality network. Our approach is general in 
the sense that it can be used with various different dependence tests, quality measures, 
and search algorithms. We have demonstrated the feasibility of our approach by means 
of experiments with two specific insfances. These instances have shown good perfor- 
mance on datasets of various sizes generated from two real-world Bayesian networks. 

The good performance of even a simple hillclimber within our approach suggests 
that the restriction of the search space of graphical structures by means of a skeleton 
is safe, in the sense that it is not likely to prune high-quality networks. To corroborate 
this observation, we have compared the computed skeletons against the true skeleton 
in varying situations. Using virtual datasets for the well-known Alarm network, skele- 
tons without any missing edges and with up to twenty extra edges have been found, 
using very small thresholds for the dependence tests. Except for its three weakest de- 
pendences, also all edges from the true skeleton of the Oesoca network have been re- 
covered with small thresholds; the numbers of extra edges for the skeletons computed 
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from virtual datasets of various sizes range between 20 and 70. We conclude that the 
use of a skeleton provides for a careful balance of accuracy and a tractable search space. 

To conclude, we would like to note that the Oesoca network includes a dependence 
that is weak in general yet becomes very strong for patients in whom a relatively rare 
condition is found. This dependence is important from the point of view of the appli- 
cation domain, although for the learning task it is indistinguishable from the numerous 
irrelevant dependences found in the data. Since there is always a tradeoff between con- 
sidering weak dependences that may he important and the computational resources one 
is willing to spend, we feel that learning a Bayesian network from data should always 
he performed in close consultation with a domain expert. 
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Abstract. NaiVe Bayesian classifiers assume the conditional indepen- 
dence of attribute values given the class. Despite this in practice often 
violated assumption, these simple classifiers have been found efficient, 
effective, and robust to noise. 

Discretization of continuous attributes in naive Bayesian classifiers has 
achieved a lot of attention recently. Continuous attributes need not neces- 
sarily be discretized, but it unifies their handling with nominal attributes 
and can lead to improved classifier performance. 

We show that optimal partitioning results from decision tree learning 
carry over to NaiVe Bayes as well. In particular, it sets decision bound- 
aries on borders of segments with equal class frequency distribution. An 
optimal univariate discretization with respect to the NaiVe Bayes rule 
can be found in linear time but, unfortunately, optimal multivariate op- 
timization is intractable. 



1 Introduction 

The naive Bayesian classifier, or Naive Bayes, is surprisingly effective in classifi- 
cation tasks. Therefore, even if it does not belong to state-of-the-art methods, it 
plays an important role — alongside decision tree learning — as standard baseline 
methods of inductive algorithms. Naive Bayesian classifiers have been studied 
extensively over the years [18,19,7]. 

In Naive Bayes numerical attributes can be handled without explicit dis- 
cretization of the value range [7,15] unlike in, e.g., decision tree induction. An 
often made assumption is that within each class the data is generated by a 
single Gaussian distribution. To model actual distributions more faithfully one 
can abandon the normality assumption and, rather, use nonparametric density 
estimation [7,15]. 

Treating numerical attributes by density estimation, Gaussian or other, indi- 
cates that numerical and discrete attributes are handled differently. Furthermore, 
discretization has been observed to increase the prediction accuracy and make 
the method more efficient [2,6]. There are discretization methods that are specific 
to Naive Bayes [2,6,25,4,26] as well as general approaches that are often used 
with naive Bayesian classifiers [13]. A particularly interesting fact is that Naive 
Bayes permits overlapping discretization [16,27] unlike many other classification 
learning algorithms. 
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In decision tree setting the line of research stemming from Fayyad and Irani’s 
[12] seminal work on optimal discretizations for evaluation functions of IDS has 
led to more efficient preprocessing approaches and a better understanding of 
the necessary and sufficient preprocessing needed to guarantee finding optimal 
partitions with respect to common evaluation functions [8,9,10,11]. In this paper 
we show that this type of analysis carries over to naive Bayesian classifiers despite 
the difference of univariate inspection in decision trees and multivariate one in 
naive Bayesian classifiers. The decision boundaries separating decision regions — 
class prediction changes — of naive Bayesian classifiers fall exactly on the so- 
called segment borders. No other cut point candidates need to be considered in 
order to find the error-minimizing discretization. 

We show that with respect to one numerical attribute, a partition that opti- 
mizes the naive Bayes rule can be found in linear time using the same algorithm 
as in connection with decision trees. However, simultaneously satisfying the op- 
timality with respect to more than one attribute, unfortunately, has recently 
turned out to be NP-complete [23]. This does not leave us with possibilities to 
solve the problem efficiently. 

In Sect. 2 we first recapitulate the basics on naive Bayesian classification. 
In Sect. 3 the optima-preserving preprocessing of numerical value ranges is 
reviewed. In Sect. 4 we prove that the same line of analysis applies to naive 
Bayesian classifiers as well. We also briefly consider multivariate discretization 
in this section. Finally, Sect. 5 concludes this article by summarizing the work 
and discussing further research possibilities. 



2 Naive Bayes 



Naive Bayes gives an instance x = (ai, . . . , a„) the label 



arg max Pr (c\ x) , 
cGC ^ I ^ ’ 



( 1 ) 



where C is the set of classes. In other words, the classifier assigns for the given 
instance the class that is most probable. The computation of the conditional 
probability Pr (c ] x) is based on the Bayes rule 



Pr (c 1 x) 



Pr (c) Pr {x \ c) 
Pr (a;) 



and the (naive) assumption that the attributes Ai, . . . , An are independent of 
each other given the class, which indicates that 



Pr (a; 1 c) = Pr {Ai = a* ] c) . 

i=l 



The denominator Pr (a;) of the Bayes rule is the same for all classes in C. 
Therefore, it is convenient to consider the quantity 



arg max Pr (cfl x) = arg max Pr (c j x) Pr (x) 
cec cec ^ I ^ ^ ^ 
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instead of (1). The two formulas, of course, always predict the same label. 

Probability estimation is based on a training set of classified examples E = 
{ {xi,yi) }™]^, where yi € C for all i. Let rric denote the number of instances 
from class c in E. Then, the data prior for class c is P (c) = rric/rn. Discrete 
attributes are easy to handle: we just estimate Pr {x \ c) based on the training set 
by estimating the conditional marginals P (A^ = | c) by counting the fraction 

of occurrences of each value Ai = in rric. 

Unless discretized, continuous values are harder to take care of and require a 
different strategy. It is common to assume that within each class c the values of 
numeric attributes are normally distributed. Then by estimating from the train- 
ing set the mean /Xc and standard deviation CTc of the continuous attribute given 
c, one can compute the probability of the observed value. After obtaining /i^ and 
(7c for an attribute Ai the estimation boils down to calculating the probability 
density function for a Gaussian distribution: 

Using Dirichlet prior, or more generally Bayesian estimation methods [17,4], 
and kernel density estimation [15] are some alternatives to the straightforward 
normality assumption for estimating a model for the distribution of the contin- 
uous attribute. In this paper, however, we are only concerned with probability 
estimates computed as data priors P (•)• 

Despite the unrealistic attribute independence assumption underlying Naive 
Bayes it is a very successful classifier in practical situations. Some explanations 
have been offered by Domingos and Pazzani [5], who showed that Naive Bayes 
may be globally optimal even though the attribute independence assumption is 
violated. It was shown that, under 0~1 loss. Naive Bayes is globally optimal for 
the concept classes conjunctions and disjunctions of literals. Gama [14] discusses 
Naive Bayes and quadratic loss. 

It is well-known that the naive Bayesian classifier is equivalent to a linear ma- 
chine and, hence, for nominal attributes its decision boundary is a hyperplane 
[7,21,5]. Thus, Naive Bayes can only be globally optimal for linearly separa- 
ble concept classes. Ling and Zhang [20] consider the representational power of 
Naive Bayes and more general Bayesian networks further. They characterize the 
representational power through the maximum XOR contained in a function. 



3 Discretizing Continuous Attributes 

The dominating discretization techniques for continuous attributes in Naive 
Bayes are unsupervised equal-width binning [24] and the greedy top-down ap- 
proach of Fayyad and Irani [13]. These straightforward heuristic approaches 
have also been offered some analytical backing [4]. However, in other classi- 
fier learners — decision trees in particular — analysis of discretization has been 
taken much further. In the following we recapitulate briefly the line of analysis 
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Data is sorted by an attribute value, classes recorded 




Bins are separated by cut point candidates 




Blocks are separated by boundary points 




Segments have different relative class distributions 



Fig. 1. The original set of 27 examples (top) can only be partitioned at bin borders 
(second from top) where the value of the attribute changes. Class uniform bins can be 
combined into blocks (second from bottom). Block borders are the boundary points of 
the numerical range. Furthermore, partitions can only happen between blocks with dif- 
ferent relative class distribution. Thus, we may arrange the data into segments (below). 
Segment borders are a subset of boundary points 



initiated by Fayyad and Irani [12]. The goal is to reduce the number of examined 
cut points without losing the possibility to recover optimal partitions. 

In decision tree learning the processing of a numerical value range usually 
starts with sorting of the data points [1,22]. If one could make its own partition 
interval out of each data point in the sorted sequence, this discretization would 
have zero training error. However, only those points that differ in their value can 
be separated from each other. Therefore, we can preprocess the data into bins, 
one bin for each existing data point value. Within each bin we record the class 
distribution of the instances that belong to it (see Fig. 1). The class distribution 
information suffices to evaluate the goodness of the partition; the actual data 
set does not need to be maintained. 

The sequence of bins attains the minimal misclassification rate. However, 
the same rate can usually be obtained with a smaller number of intervals. The 
analysis of the entropy function by Fayyad and Irani [12] has shown that cut 
points embedded into class-uniform intervals need not be taken into account, only 
the end points of such intervals — the boundary points — need to be considered 
to find the optimal discretization. Elomaa and Rousu [8] showed that the same 
is true for several commonly-used evaluation functions. 

Subsequently, a more general property was also proved for some evaluation 
functions [9]: segment borders — points that lie in between two adjacent bins with 
different relative class distributions — are the only points that need to be taken 
into account. It is easy to see that segment borders are a subset of boundary 
points. 
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For strictly convex evaluation functions it was shown later that examining 
segment borders is necessary as well as sufficient in order to be able to dis- 
cover the optimal partition. For Training Set Error, which is not strictly convex, 
it suffices to only examine a subset of the segment borders. These points are 
called alternations and they are placed on segment borders where the frequency 
ordering of the classes changes [10,11]. 

These analyses can be used in preprocessing in a straightforward manner: 
we merge together, in linear time, adjacent class uniform bins with the same 
class label to obtain example blocks (see Fig. 1). The boundary points of the 
value range are the borders of its blocks. Example segments are easily obtained 
from bins by comparing the relative class distributions of adjacent bins (see Fig. 
1). This can be accomplished on the same left-to-right scan that is required to 
identify bins. Also alternations can be detected during the same scan. 

4 Decision Boundaries of NaiVe Bayes 

We show now that segments in the domain of a continuous attribute are the loca- 
tions where Naive Bayes changes its class prediction, i.e., its decision boundaries. 
We start from undiscretized domains, go on to error-minimizing discretizations, 
and finally consider optimal partitions with respect to several attributes. 

4.1 Decision Boundaries for Undiscretized Attributes 

We start by examining the decision boundaries that Naive Bayes sets when the 
continuous attribute is not discretized, but each numerical value is treated sep- 
arately. When we consider the decision boundaries from the point of view of one 
attribute, we assume an arbitrary fixed value setting for the other attributes. In 
the following, notation P (•) is used to denote the probability estimates computed 
by Naive Bayes, to distinguish them from true probabilities. 

Theorem 1. The decision boundaries of the naive Bayesian classifier are situ- 
ated on segment borders. 

Proof. Let Ai be a numerical attribute and let V and V" be two adjacent 
intervals in its range separated by cut point Vi. For any other attribute Aj, 
i ^ j, let Vj be an arbitrary subset of the values of Aj. Let us denote by 

T = ViX ■■■ X U_i X Vi+i X ■■■ xVn 

the Cartesian product of these subsets. We assume that the prediction of naive 
Bayesian classifier within V' x T is c' G C, and the prediction within V” x T 
is c" G C . In other words, looking at the situation only from the point of view 
of Ai and taking all other attributes to have an arbitrary (but fixed) value 
combination, the decision boundary is set between intervals V and Y" . Then 

P{e!\V' XT) =P (cO P (U' I c') n ^ I c') 

> P (c") P {V' I c") Y[ P {Vj I c") = P (c" I y' X T) . 
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By reorganizing the middle inequality we get 

P{V'\c") ^ P{c')U,^^P{Vj\c') ■ 

On the other hand, within V" x T we obtain, by similar manipulation, 

P{V"\c") ^ P{c')U,^,PiV,\c') ■ 



Put together, (2) and (3) imply 

P{V” I cQ ^ P{P')Uj^^P{Vj I c") ^ P {V' I c') 

P{V" I c") ^ I c') ^ P{V'\ c")' 

By using the Bayes rule to the conditional probabilities and canceling out equal 
factors we get 

P (c' I V") P (c' I V') 

P{c"\V") P{c"\V')' 

Hence, the relative class distributions must be strictly different within the inter- 
vals V' and V" making Vi thus a segment border. 

The above result does not, of course, mean that all segment borders would 
be places for class prediction change. However, the class prediction changes of an 
undiscretized domain are confined to segment borders. Consequently, no loss is 
incurred in grouping the examples in segments of equal class distribution. On the 
contrary, we expect to benefit from the more accurate probability estimation. 

4.2 Decision Boundaries in fc-Interval Discretization 

Let us now turn to the case where the continuous range has been discretized 
into k intervals. We will prove that in this case too segment borders are the only 
potential points for the decision boundaries. 

The following proof has the same setting as the proofs in connection with 
decision trees [9]. The sample contains three subsets, P, Q, and R, with class 
frequency distributions 



mm m 

P=^Pp ^ = H and r = ^ r^, 
i=i i=i i=i 

where p is the number of examples in P and pj is the number of instances of 
class j in P. Furthermore, m is the number of classes. The notation is similar 
also for Q and R. 

We consider the fc-ary partition { Si, . . . , Sk} of the sample, where subsets 
Sfi and Sh+i, I < h < k — 1, consist of the set PLIQUR, so that the split point 
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Si,...,Sh-i Sh{£) Sh+i{£) Sh+2,...,Sk 




Fig. 2. The following proofs consider partitioning of the example set P U Q U R into 
two subsets Sh and Sh+i within Q. No matter where, within Q, the cut point is placed, 
equal class distributions result 



is inside Q, on the border of P and Q, or that of Q and R (see Fig. 2). Let £ be 
a real value, 0 < £ < q ^ . Let Sh{£) denote partition interval Sh that contains 
P and the £ first examples from Q. In the same situation Sh+i{£) denotes the 
interval Sh+i- We assume that splitting the set Q so that £ examples belong to 
Sh{£) and q — £ to Sh+i{^) results in identical class frequency distributions for 
both subsets of Q regardless of the value of £. 

Let T again be the Cartesian product of the (arbitrary) subsets in dimensions 
other than the one under consideration. In this setting we can prove the following 
result which will be put to use later. 

Lemma 1. The summaxceC P (cH Sh{£) x T)+maxcec P (cH Sh+i{£) x T) is 
convex over £ € [0, q] . 

Proof. Let £\,. . . be the class prediction change points within [0, q]. With- 
out loss of generality, let us denote byci,l<i<r— 1, the class predicted within 
Sh{£), £ &]£i,£i+i\. The probability of instances of class c within Sh{£) x T can 
be expressed as 



P{cf^Sh{£) X T) 



Pc-P{T\c) 

n 



qc/q-P{T I c) 
n 



which describes a line with offset {pc/n)P (T | c) and slope {{qc/q)/n)P {T \ c) 
(see Fig. 3). Now, it must be that the offsets satisfy 



p,,-P{T\c) ^ ^ p,^_, ■ P {T \ Cr-l) 

n ~ ~ n 

and the slopes of the lines satisfy 

QcJq-PjP I Cl) ^ ^ qc,-Jq-P{T I Cr-i) 

n “ “ n 



Interpreting the situation geometrically, we see that maxc P (c fl S'/j(£) x T) 
forms a convex curve (Fig. 3). By symmetry, max^ P (c fl S'^_i_i(£) x T) also is 
convex, and the claim follows by the convexity of the sum of convex functions. 



^ No harm is done considering splitting Q in other points than those corresponding to 
integral number of examples, since we are proving absence of local extrema. 
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Fig. 3. The maxima of the sum of the most probable classes in Sh{() and Sh+i{i) forms 
a convex curve over [0, q\ 



The following proof shows that a cut point in between two adjacent subsets 
Sh and Sh+i in one dimension is on a segment border, regardless of the context 
induced by the other attributes. Due to the additivity of the error, the result 
also holds in the multisplitting case, where a number of cut points are chosen in 
each dimension. 

Theorem 2. The error-minimizing cut points of Naive Bayes are located on 
segment borders. 

Proof. Let cl(£) = argmaxcgc P {cd Sh{i) x T) be the most probable class for 
Sh{f) according to Naive Bayes criterion in this situation. In other words, 

p {CL{^) n Sh{e) X T) = maxP (c n Sh{e)) P (T | c) . 

cGC 

Similarly, let cr{€) denote the most probable class in Sh+i{(). 

The minimum-error partition is the one that has the smallest combined error 
in the subsets Sh{€) and Sh+i{P)- Thus, we want to optimize 

min (P {Sh{e) xT)-P (cl(£) n Sh{i) x T) 
mo.q] 

+ P{Sh+i{£)xT)-P{cR{£)nSh+i{£)xT)), 

which is equal to 

min {p{SxT)-{p{cL{e)nSh{e)xT) + p{cR{e)nSh+i{e)xT))), 
ee[o,q] 

where S = PUQUR. By Lemma 1 this is a concave function of ^ G [0, q\. Hence, 
it minimizes at one of the extreme values of £, which are the locations of the 
segment borders. Thus, we have proved the claim. 

In principle it might be possible to reduce the number of examined points by 
leaving some segment borders without attention. Can we identify such a subset 
efficiently? 



152 



Tapio Elomaa and Juho Rousu 



In univariate setting the answer is affirmative: Only the set of alternation 
points need to be considered. These points are those in between adj^acent bins 
V and V" with a conflict in the frequency ordering of the classes: P (c | V') < 
P {c' I V') and P {c' \ V") < P {c\ V"). This is a direct consequence of the fact 
that training error minimizes on such points. The set of alternation points can 
be found in linear time, so it can speed up the discretization process [10]. 

In multivariate setting, the other attributes need to be taken into account. 
Let V' X T and V" x T be two adjacent hyperrectangles. One can show that 
there is no decision boundary in between them if for all class pairs P and P' we 
have either 

1. P{P \V' xT) <P{c\V' xT) and P {P \V” xT) <P{c\ V x T), or 

2. P{P \ V xT)>P{c\V X T) and P {P \V” xT)>P{c\ V x T). 

The problem with using this criterion to prune the set of candidate cut points 
is that the definition depends on the context T, and there is an exponential 
number of such contexts. So, even if all segment borders are not useful, deciding 
which of them can be discarded seems difficult. 

Thus, in practice, finding a linear-time preprocessing scheme to reduce the 
set of potential cut points to a proper subset of segment borders is difficult. 

4.3 Decision Boundaries of Naive Bayes in Multiple Dimensions 

It is well known that in the discrete (two-class) case the decision boundary is a 
(single) hyperplane in the input space [7,21]. In case of continuous attributes the 
situation is much more difficult: The decision regions and their boundaries may 
have arbitrary shape [7]. However, from preceding results we know that decision 
boundaries in reality can only occur at segment borders of each continuous at- 
tribute. Therefore, we actually can consider discretized ranges instead of truly 
continuous attributes. 

In Fig. 4 the example set of Fig. 1 has been augmented with another (arbi- 
trary) dimension. The segments of these two dimensions divide the input space 
(a plane) into a 6 x 5 grid, where each grid cell gets assigned a class label. Class 
uniform rows and columns get a uniform labeling but otherwise one cannot de- 
termine the labeling of grid cells based only on one dimension. Values of both 
attributes are needed to determine the class label. For example, when the at- 
tribute depicted on the y-axis of Fig. 4 has a value in its last segment, depending 
on the value of the attribute along the x-axis, there are two segments where the 
most probable prediction would be d and two segment where it would be e. 

In general the discretized input space is divided into hyperrectangular cells, 
each assigned the class label according to the relevant segment statistics. 

4.4 On Finding Optimal Discretizations for Naive Bayes 

Theorem 2 tells us that the decision boundaries of Naive Bayes are always lo- 
cated on segment borders, which makes it possible to preprocess the data into 
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Fig. 4. The segments of two continuous attributes divide the input space into a rect- 
angular grid. All grid cells are assigned the class label determined by the residual sums 
of the corresponding segments 



segments prior to discretization. By this result, one can use the same linear-time 
optimization algorithm to find the univariate NaiVe Bayes optimal multisplits as 
in the case of decision trees [9,11]. This so-called Auer-Birkendorf algorithm is 
based on dynamic programming. During a left-to-right scan over the segments, 
one can maintain the information required to decide into how many intervals 
should the data be split and where to locate the interval borders to obtain as 
good value as possible for the partition. 

However, the situation in the multivariate setting is much more difficult. Even 
with the data preprocessed into segments we may still have a daunting amount 
of possible discretizations: 0(2^) to be exact, where T = is the 

number of cut points candidates along the i-th dimension. Could there, never- 
theless, exist an efficient algorithm for optimal discretization? Unfortunately, we 
have to answer in the negative, as shown by the next theorem [23]. 

Theorem 3. Finding the Naive Bayes optimal discretization of the real plane 
is NP-complete. 

This can be proved by a reduction from Minimum Set Cover using a similar 
construction as Chlebus and Nguyen [3] to show that already optimal consistent 
splitting of the real plane is NP-complete. We construct a configuration of 
points in the 2D plane corresponding to the set covering instance and show two 
properties [23]: 

1 . The plane can be consistently discretized with k cut lines if and only if there 
is a set cover of size k for the given set cover instance. 

2. The optimal Naive Bayes discretization coincides with the consistent dis- 
cretization. 
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Since the hypothesis class of Naive Bayes is the set of product distributions 
of marginal likelihoods, the above theorem strengthens the negative result of 
Chlebus and Nguyen [3], which holds for general axis-parallel partitions of 
Observe that the result easily generalizes to cases with more than two dimensions 
by embedding the 2D plane corresponding to the set covering instance into the 
higher dimensional space. The problem remains equally hard when there are 
more than two classes. 

From the point of view of finding the optimal multivariate splits, exhaustive 
search over the segment borders of all dimensions is the remaining possibility for 
optimization, which becomes prohibitively time-consuming on larger datasets. 

5 Conclusion 

Examining segment borders is necessary and sufficient in searching for the opti- 
mal partition of a value range with respect to a strictly convex evaluation func- 
tion [11]. The same set of cut point candidates is relevant for Naive Bayes: Their 
decision boundaries (in disjoint partitioning) fall exactly on segment borders. 

On the other hand, it seems that for an algorithm to rule out some segment 
borders from among the decision boundary candidates, it would have to examine 
too many contexts to be efficient. Therefore, preprocessing the value ranges of 
continuous attributes into segments appears necessary if one wants to detect all 
class prediction changes. Such preprocessing, naturally, is sufficient. 

As future work we leave the empirical evaluation of the usefulness of seg- 
ment borders and their accuracy in probability estimation as well as studying 
possibilities to approximate optimal multivariate discretization and the utility 
of segment borders therein. 
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Abstract. Developments in physical and biological technology have re- 
sulted in a rapid rise in the amount of data available on the 3D structure 
of protein-ligand complexes. The extraction of knowledge from this data 
is central to the design of new drugs. We extended the application of 
Inductive Logic Programming (ILP) in drug design to deal with such 
structure-based drug design (SBDD) problems. We first expanded the 
ILP pharmacophore representation to deal with protein active sites. Ap- 
plying a combination of the ILP algorithm Aleph, and linear regression, 
we then formed quantitative models that can be interpretated chemi- 
cally. We applied this approach to two test cases: Glycogen Phosphory- 
lase inhibitors, and HIV protease inhibitors. In both cases we observed a 
signihcant (P < 0.05) improvement over both standard approaches, and 
use of only the ligand. We demonstrate that the theories produced are 
consistent with the existing chemical literature. 



1 Introduction 

Most drugs are small molecules (ligands) that bind to proteins [19]. When knowl- 
edge of the 3D structure of the target protein is used in the drug design pro- 
cess, the term structure-based drug design (SBDD) is used. Knowledge of the 
co-crystallized protein-ligand complex structure is particularly important as it 
shows how a drug interacts with its target. The binding of the ligand to its 
target can be regarded as a key (ligand) fitting a lock (active site) (figure 1). 
To ensure this complementarity, a potential candidate must be the right size 
for the binding site, must have the correct binding groups to form a variety of 
weak interactions and must have these binding groups correctly positioned to 
maximize such interactions. These interactions are primarily hydrophobic and 
electrostatic (hydrogen bonds, interactions between groups of opposite charges). 
They are individually weak, but they lead if in sufficient number, to a strong 
overall interaction {binding energy) enabling the ligand to bind to the target site 
(also referred as activity). These general principles of drug interactions are now 
well understood, but specific relations between molecular structure and function 
are still too complex to be delineated from physico-chemical theory and semi- 
empirical approaches are necessary. From the computational side[14], SBDD in- 
volves two main sub-problems to design new active compounds: the prediction of 
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Fig. 1. Schematic representation of a ligand binding a protein illustrating the comple- 
mentarity of shape and property (left). Example of a three elements pharmacophore 
(right) derived from the 3D structure of the ligand in the known ligand-protein complex 
(left). 



the most likely ligand mode binding conformation (docking) and the estimation 
of the relative binding energy of a protein-ligand complex (scoring) [9] . 

The Protein Data Bank (PDB)[3] is the single worldwide repository for the 
processing and distribution of 3D biological macromolecular structure data and 
the number of co-crystallized protein-ligand complexes is rising exponentially 
over the years. The state-of-the-art in SBDD is to use general propositional re- 
gression functions that are designed to be applicable to any active sites (although 
parameterized using only a small subset of the PDB). Predictions are not gen- 
erally tuned for specific active sites [20]. Here we describe an Inductive Logic 
Programming (ILP) / Relational Data Mining (RDM) approach for SBDD based 
on generalizing over examples of ligands bound to a specific active site. 

The structural nature of many chemical structure- function/property rela- 
tionships has proven to be well suited to Inductive Logic Programming (ILP) 
[17]. We take the name of ILP to generalise all work in ILP and the related field 
of Relational Data Mining. In drug design, ILP has been successfully applied to 
model structure-activity relationships (SAR). Here the task was to obtain rules 
that could predict biological activity or toxicity of compounds from their chem- 
ical structure[12,16]. ILP is based on logical relations and differs from standard 
chemoinformatics approaches that use attributes (molecular descriptor, molec- 
ular field, etc) to encode the chemical information. For such problems, logic 
provides a unified way of representing the relations between objects (atoms and 
bonds) . ILP systems have progressively been shown to be capable of handling 
1[12], 2[16] and 3 dimensional [8, 21] descriptions of the molecular structures, al- 
lowing the development of compact and comprehendable theories. Moreover, ILP 
has achieved the same predictive power or has significantly improved the tra- 
ditional QSAR (Quantitative SAR) built using standard propositional learners 
and statistical methods [15, 16]. 

We take the next natural step in developing the ILP approach to drug design 
by extending it to SBDD. The aim of this study is four-fold: 
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— to explore how best to represent the relationship between ligand and protein 
and how to adapt the ILP tools to suit our study. 

— to test whether ILP can form accurate quantitative models of the binding 
energy of ligands. 

— to compare the ILP results with conventional 3D QSAR and SBDD pro- 
grams. 

— to examine the insight obtained from the ILP rules. 

2 Methods and Materials 

This section describes the complete process we employed to address our problem. 
The methodology adopted for this study is organized as follows: 1) collect 3D 
structural data from the PDB and their corresponding biological activities in 
the literature; 2) transform the molecular structures into facts from a molecular 
modelling package and extract the features of interest to build the background 
knowledge; 3) form 3D structural features (pharmacophores) using ILP; 4) form 
regression models using the pharmacophores and assess their predictive power. 

2.1 Datasets 

A complete description of the protein-ligand series is reported in section 3. While 
the PDB gathers most of the structural data of biomolecular systems, there is no 
unified way to distribute biological activities and structures directly to analysis 
methods. A preprocessing step is necessary to clean the PDB files: isolation of 
the ligand, addition of missing atoms or residues, removal of useless information, 
etc. Despite the fact that the way ILP encodes chemical information is less 
sensitive to the initial preparation of the complexes than other SBDD methods 
(protonation state for example), extra care was required to form the proper 
assignment of the atom types before building the Datalog program. 

2.2 Background Knowledge and Its Representation 

ILP systems use background knowledge to further describe problems. The back- 
ground knowledge comprises our statements about the most relevant features 
to explain the biological activity. This mainly involves using the most compre- 
hensive and the most declarative representation to encode domain-dependant 
information. The content of the background knowledge used for this study is 
illustrated in figure 2. 

In our representation, the three dimensional information is expressed in terms 
of distances between atoms or structural groups {building blocks) giving the fi- 
nal rule a pharmacophore like form. The concepts of (3D) pharmacophore and 
pharmacophore elements are very important in medicinal chemistry: a pharma- 
cophore is an arrangement of atoms or groups of atoms which influence dras- 
tically the activity at a target receptor[19]. Pharmacophore representation ex- 
presses the potential activity in a language familiar to medicinal chemists and 
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Predicates related to the ligand (all arity 3) : 
hacc , hdon , alcohol , equiv_ether , six_r ing , hetero_non_ar_6_r ing , amide , 
carbonyl , amine_0h .methyl , lipo_seg , ar_6c_ring , halogen , f ive_ring . 

Predicates related to the protein (all arity 4) : 
prot_backc2 ,prot_cooh,prot_alcohol ,prot_negcharge ,prot_poscharge , 
prot_amide , prot_guadinium , prot_lipo_seg . 

Hydrogen bonding predicate: hb/4. 

Water position predicate: water/3. 

Fig. 2. General chemical knowledge defined in the background knowledge. 



is easily convertible for searching compounds in chemical databases. A pharma- 
cophore usually refers to the ligand only but, in the following, we apply this 
definition to the active site as well. 

The Prolog implementation requires facts that store the location of particular 
groups and a predicate dist/4 which states the Euclidean distance in 3D space 
between two groups. For example, the following conjunction, 

hdonor (110 , 1 , A) .methyl (llO.l.B) ,dist(A.B,6.3.1.0) 

represents the fact that in the compound 110 in its conformation labelled 1, 
there are a methyl group A and hydrogen bond donor B separated by 6.3 ± 1.0 
Angstroms. 

Pharmacophore mapping with ILP avoids the need of traditional 3D QSAR 
and pharmacophore learning methods to prealign and superpose all the ligands 
to a common extrinsic coordinate system. The requirement is forced by the 
propositional nature of the traditional approaches [19]. ILP has the advantage 
that it can directly use the intrinsic coordinate system of each complex. 

Some ligands may also have more than one conformation (3D structure). This 
is the problem which first highlighted the multiple instance problem, and most 
propositional machine learning algorithms require major changes to deal with 
it [13]. ILP has the advantage that it can naturally deal with multiple instance 
problems. 

Only a brief summary of the predicates used for this study is presented here. 
A Prolog example of generating building blocks facts from molecular structure is 
illustrated in [21]. The pharmacophore elements, available for the present SAR 
analysis (figure 2) can be divided in the following two categories: 

— Ligand related predicates state the position in 3D space of simple or com- 
plex chemical groups providing, for example, the definition of methyl group 
or aromatic rings. They can also encode some important physico-chemical 
properties of the atoms or the building blocks, such as their ability to form 
hydrogen bonds. 

— The active site is described by integrating specific chemical knowledge re- 
lated to a number of important amino acids and water molecules as well as 
representing hydrogen bonds(/i&/^) explicitly. 
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2.3 Constructing Theories with Aleph 

The learning algorithm used for this study is the ILP system Aleph [25]. This 
algorithm follows the classic ILP search engine framework [6]: given a background 
knowledge (i.e. relations describing the molecular structures), a set of examples 
(i.e. training data) and a language specification for hypotheses, an ILP system 
will attempt to find a set of rules that explain the examples using the background 
knowledge provided. We chose Aleph because it can be easily tuned to suit 
our learning system which proceeds by iterating through the following basic 
algorithm: 

— The training data is formed by dichotomising the data into two sets (positive 
and negative) based on their biological activity. Because there is not a natural 
cut-off to the predictor, an example is chosen from the training data and the 
positive set comprises the molecules with the closest activity (1/3 of the 
training data are used in this study). The rest of the examples (2/3 of the 
molecules) are considered as negative examples. 

— The most specific clause {bottom clause) that entails the above example is 
then constructed within the language restrictions provided [22]. This is known 
as the saturation step. The bottom clause prunes the search before it begins 
by identifying all the potential clauses explaining the activity of the selected 
molecule. 

— The search is a refinement graph search: it proceeds along the space of clauses 
(partially ordered by 0-subsumption) between the specific hypothesis {bot- 
tom clause) and the most general clause {empty body)\2b]. We require a 
complete search in order to find all the possible pharmacophores consistent 
with the data. 

— The new clause is added to the theory and the search is repeated until 
all the examples are saturated once. Pharmacophores are, thus, learnt for 
both highly and less active compounds. This contrasts with the usual ILP 
framework where all examples made redundant are removed {cover removal 
step [25]). Our aim is to use the rules as indicator variables to build quanti- 
tative models and the compactness will be assured by the model rather than 
by the ILP process. 



2.4 Building QSAR Models 

To combine the ILP pharmacophore into a regression model we used a vanilla in- 
house multiple linear regression program. The predictive power of the model was 
evaluated using leave-one-out cross-validation (involving the ILP and regression 
steps). The results are presented using the squared correlation coefficient {R^y) 
between the actual and the predicted value of the activity. This is the standard 
measure in drug design. In the following, the activity is evaluated in logarithm 
units of the inhibition constant {log{l/ Ki)). 

We compare the results of the ILP models with the use of two conventional 
drug design approaches, CoMFA and a SBDD scoring function. 
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CoMFA (Comparative Molecular Field Analysis [5]) is the most commonly 
used 3D QSAR ligand-based approach[18]. The basic idea of CoMFA is to su- 
perimpose ligands onto a common 3D grid, and then sample their electronic 
structure at regular points (voxols). This has the benefit of transforming the 
data into a propositional form, but relies on the (often false) assumption that 
every molecule in the series interacts with the same target molecule and in the 
same way {common receptor assumption) [18]. It can also be difficult to know how 
best to superimpose molecules that do share much common structure. CoMFA 
also has the drawback of producing thousands of correlated attributes which 
requires the powerful PLS regression approach to avoid overfitting. In CoMFA, 
neighbouring voxol attributes are generally highly correlated, yet this informa- 
tion is thrown away. PLS can be used to partially regenerate this correlated 
structure. In the following, we present the CoMFA analysis using the observed 
ligand conformation in the protein-ligand complex (common receptor assump- 
tion) within an optimized molecular field (superposition/translation). 

As no general scoring function has been reported to date that is able to 
predict binding affinities with a high degree of accuracy [10], we present results 
with the most accurate approach, for each series under study, among five func- 
tions available in the CScore module of Sybyl[l] to compare models including 
information on the active site. 



3 Results and Discussion 

We report results obtained from our approach on two protein targets: the glyco- 
gen phosphorylase b (GP) and the human immunodeficiency virus protease 
(HIV-PR) enzymes. Chemical structures, inhibition data and predicted biologi- 
cal activities can be accessed from 

http : //www . aber . ac . uk/ compsci/Research/bio/ dss/. 

We chose to study GP and HIV-PR because: a significant amount of 3D in- 
formation is available on them in the PDB, allowing an accurate validation of the 
method; they have already been extensively studied, giving us the opportunity 
to verify the meaning of the rules found by Aleph, and comparable published 
models; the two datasets stand at two extreme points in SBDD problems. The 
GP dataset is an homogeneous series of 3D structures with only slight mod- 
ifications of the structure of ligands. This contrasts with the HIV-PR dataset 
where the structures of the inhibitor, and to a lesser extent the protein sequence, 
exhibit dramatic changes from one complex to the next. 

3.1 Glycogen Phosphorylase b 

The set of 51 co-crystallized inhibitors of the glycogen phosphorylase b has been 
taken from the same SBDD project [23]. In this case, the chemical structure of the 
GP inhibitors is homogeneous; meeting then the usual requirements of traditional 
2D/3D QSAR (common receptor assumption). However, the CoMFA [5] analysis 
on the 51 inhibitors leads to a poor predictive power (rg„=0.46, table 1). One 
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would have thought that we should have been able to derive more physical 
properties characterising ligand-receptor interaction but the best structure-based 
binding energy function accuracy is only r^„=0.34 (FlexX[24], table 1). 



Table 1. Models accuracies from the GP dataset. 



Id. 


Method 


Accuracy {r^„) 


1 


CoMFA 


0.46 


2 


FlexX 


0.36 


3 


ILP: Ligand only 


0.66 


4 


ILP: Ligand -|- water/3 

-I- H-bonds involving ligand and water 


0.74 



In the case of GP, ligands bind at the catalytic site buried deeply from 
the surface of the enzyme and they stabilize an inactive form of the protein 
mainly through specific hydrophilic interactions with the protein and some water 
molecules. Water molecules are well known to play a significant role in stabilizing 
protein-ligand complexes but they remain a challenge for many QSAR analyses 
as their mobility violates the common receptor assumption. Table 1 also shows 
a comparison between results where the background knowledge contains facts 
only related to the ligand and where the background knowledge also contains 
facts related to the water molecule position and all the possible hydrogen bonds 
between the ligand, the active site and the molecules of water. 

The results show that our ILP approach outperforms CoMFA and FlexX 
{P < 0.005 for both cases). Addition of more informative knowledge regarding 
the active site improves the predictive power of the model (P < 0.025). The 
results demonstrate the need to explicitly include hydrophilic interactions in 
forming a good predictor. The addition of the protein and water interaction 
also makes the interpretation of the model easier, as they highlight the most 
important features involved in the binding (see below). The resulting theory 
and QSAR model are reported in figure 3. The first three (pharmacophores) 
rules PI, P2 and P3 are overlaid with a highly active ligand to illustrate the 
main features found by the hypothesis on the same figure. Taking into account 
the relative homogeneity of the inhibitors, a close inspection of the rules found 
by Aleph in experiment 3 found that all the key chemical groups are involved in 
the final model. As shown in figure 3, ILP globally simplifies the interpretation. 
Insight into the binding mechanism is outlined in two points: 

— The amide group in the region 2 is a constant in the three rules {amide/3), 
acting, though, as the basis for the construction of the three pharmacophores. 
This not surprising as this group is associated with the high activity of the 
series. Due to the high number of possible interactions in the region 1 and 
3 , the theory involves OH groups {alcohol/3, rules P2 and P3) rather than 
explicit hydrogen bonds. 

— The most surprising feature denoted by our method is related to the distal 
part (region 4) of the active site. Most rules involve either the position of 
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PI : active(A) 

hb(A,B ,C,D) , carbonyl(A,B ,E) , amide (A,B,F) ,dist(A,F,E,1.35,1.0) , 
dist(A,C,E,9.47,1.0) ,dist (A,D,E, 10 . 77 , 1 . 0) ,dist (A,C,F, 10 . 74, 1 • 0) , 
dist(A,D,F,11.95,1.0) . 

P2 : active(A) 

water(A,B,C) .alcohol (A, B,D) , alcohol ( A, B,E) ,amide(A,B,F) , 
dist(A,C,D,14.56,1.0) ,dist (A,C ,E, 13 . 12 , 1 . 0) ,dist (A,D,E, 5 . 98, 1 . 0) , 
dist(A,F,D,4.63,1.0) ,dist(A,F,E,3.00, 1 .0) ,dist (A,C,F, 11 . 12 , 1 . 0) . 
P3 : active(A) 

water(A,B,C) .water (A,B,D) , alcohol ( A, B,E) , amide (A,B ,F) , 
dist(A,C,D,4.83.1.0) .dist (A,C,E, 13 . 69 , 1 . 0) ,dist (A,D,E, 14 . 38, 1 . 0) , 
dist(A,F,E,4.80.1.0) ,dist(A,C,F,9.29.1.0) .dist (A,D,F, 9 . 75 , 1 . 0) . 

P4 : active(A) 

water(A,B,C) .alcohol (A, B,D) ,methylen(A ,B ,E) ,equiv_ether(A,B,F) , 
dist (A, C.D, 12. 50, 1.0) .dist (A,E,D ,4 . 42 , 1 . 0) .dist (A,C,E,8 . 72 , 1 . 0) , 
dist (A, C.F. 10. 03, 1.0) .dist (A,D,F,3 . 01 , 1 . 0) .dist (A.E.F, 1 . 84, 1 . 0) . 



QSAR model : log(l/Ki) = 2.43 + 0.76*P1 + 0.91*P2 + 0.35*P3 - 0.49*P4 




Fig. 3. Theory from experiment 4, table 1 (top). 2D representation of the interaction 
involved in the binding of the ligand (numbered 26 in [23]) found by our ILP approach 
(bottom). Shaded circles /rectangles and open triangle outline the pharmacophore el- 
ements involved in the theory. Intermolecular interactions between the inhibitor and 
the binding site are represented with dashed lines. 



two water molecules or an explicit hydrogen bond interaction with Arg292 
{water /3 and hb/4)- How could these interactions be involved in the binding 
process? We found that [4] suggested that the presence of water overlapping 
this region could explain a high inhibitory effect with a strong stabilization 
of the enzyme in the 280’s loop. 



3.2 Human Immunodeficiency Virus Protease 

The second set concerns a series of inhibitors of the well studied human immun- 
odeficiency virus protease. In this case, we are dealing with a series of diverse 
ligands, some inhibitors are present in two conformations and some residues in 
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Table 2. Models accuracies from the HIV-PR dataset. 



Id. 


Method 


Accnracy (r^„) 


1 


CoMFA 


0.58 


2 


ChemScore 


0.35 


3 


ILP: Ligand only 


0.62 


4 


ILP: Ligand -|- Active site 
+ H-bonds involving the ligand -I- water/ 3 


0.75 



the protein may be mutated (i.e. the sequence of amino-acids can differ from one 
structure to the next). The same process as for GP is reported in table 2. 

In this case, the ILP structure based model (r^„=0.75) improves on the 
CoMFA (r^„=0.58, P < 0.05) and the scoring function ChemScore[7] (r^„=0.35, 
P < 0.001) prediction of the binding energy. The theory from experiment 4 (ta- 
ble 2) is reported in figure 4. The first three rules PI, P2 and P3 are mapped 
onto the highest active inhibitor (PDB code: Ihvj). 

For HIV-PR, the structural requirements for highly active ligands can seen 
upon two points of view: 

— Polar interaction are highlighted by a specific hydrogen bond with Asp29 
(region 3 ) and the need of a group {alcohol/3 in P3) able to interact with 
Asp25 (region 1). This last amino acid is involved in the catalytic mechanism 
of HIV-PR[2]. Finally, the carbonyl group {carhonyl/3 in P3) in region 2 
interacts with the water molecule known to be crucial for the binding process. 

— Hydrophobic interactions are more difficult to include in the background 
knowledge as they are not as local as the hydrogen bonds, for example. 
Nevertheless, they are implicitly involved in the theory. PI and P2 largely 
encode the relative orientation/position of four aromatic rings (mapped by 
lipo-seg/3 and six-ring/3). The hydrophobic behaviour {protJiposeg/3) of 
the residues 81 and 84 (regions 4 and 5 ) are revealed to be important to 
ensure these non polar contacts. 



4 Conclusions 

We have presented a new procedure for the formulation of accurate and easily 
interpretable QSARs to predict binding energy within a series of protein-ligand 
complexes. This extends the application of ILP in drug design to problems where 
the structure of the binding protein is known. To form the models we used a 
relational description of the molecular structure to find rules in the form of 
pharmacophores, and linear regression to combine the pharmacophores into a 
predictive model. We consider that the ILP approach was effective for the fol- 
lowing reasons: 

— the logical formalism is an effective representation for the diverse types of 
knowledge required. 
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PI : active(A) 

hb(A,B ,C,D) , lipo_seg(A,B ,E) , six_ring(A ,B ,F) ,dist (A,C,E, 5 .49, 1 . 0) , 
dist(A,C,F,5.31,1.0) ,dist (A,D,E, 7 . 22 , 1 . 0) ,dist (A,D,F,7 . 77, 1 . 0) . 

P2 : active(A) 

lipo_seg(A,B,C) ,prot_lipo_seg(A,B,84,D) ,six_ring(A,B,E) , 
dist(A,C,D,5.93,1.0) ,dist (A,C,E,4 . 88, 1 . 0) ,dist (A,D,E, 9 . 19 , 1 . 0) . 

P3 : active(A) 

alcohol(A,B,C) , carbonyl (A ,B ,D) ,prot_lipo_seg(A,B,81,E) , 
dist(A,C,D,5.30,1.0) ,dist(A,C,E,11.54,1.0) ,dist (A,D,E,8 . 67 , 1 . 0) . 
P4 : active(A) 

carbonyl (A, B,C) ,pos_charge(A,B,D) ,prot_negcharge(A,B,29,E) , 
dist(A,C,D,9.23,1.0) ,dist(A,C,E,6.09,1.0) ,dist (A,D,E, 9 . 61 , 1 . 0) . 



QSAR model : log(l/Ki) = 8.00 + 0.81*P1 + 0.43*P2 + 0.58*P3 - 0.90*P4 




0 



rp3| 



Fig. 4. Theory from experiment 4, table 2 (top). 2D representation of the interaction 
involved in the binding of Ihvj found by our ILP approach. The same notation as in 
figure 3 is adopted. 



— the coordinates of molecular structures can be used directly without the 
superposition or prealignment prior to some traditional approaches. 

~ ILP deals naturally with the multiple instances problem and can find all 
possible pharmacophore consistent with the background. 

— the theories generated are compact and comprehensible in a language famil- 
iar to scientists. 

We have tested this approach on two qualitatively different datasets. In both 
examples, the ILP models outperformed and yet were of equal complexity to 
the results of traditional SBDD approaches. The ILP models were directly in- 
terpretable by mapping the learned pharmacophore onto selected examples, and 
these interpretations were consistent with previous reported analysis. The deriva- 
tion of so-called receptor-based pharmacophore does not only improve the predic- 
tive power of the models but allows the identification of key interaction hotspots. 
In the case of GP, ILP has brought an unexpected insight into the binding mech- 
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anism. Analysis of HIV-PR hypotheses shows that our approach could deal with 
heterogeneous series of protein- ligand. Here, we used direct information from 
the experimentally resolved structure of a similar protein-ligand complex to give 
the clues to whereabouts in the active site the ligand binds and in what confor- 
mation. Work is in progress to evaluate the applicability of our approach when 
such information is unavailable or insufficient. Flexible docking techniques can 
be used to explore the conformational space of the ligand within the active site 
leading to a highly diverse docking solution set: either our ILP models can be 
used to restrict the search space[ll] or pharmacophores can be learnt from the 
docking set. 
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Abstract. Inducing classifiers that make accurate predictions on future 
data is a driving force for research in inductive learning. However, also 
of importance to the users is how to gain information from the models 
produced. Unfortunately, some of the most powerful inductive learning 
algorithms generate “black boxes” — that is, the representation of the 
model makes it virtually impossible to gain any insight into what has 
been learned. This paper presents a technique that can help the user 
understand why a classifier makes the predictions that it does by provid- 
ing a two-dimensional visualization of its class probability estimates. It 
requires the classifier to generate class probabilities but most practical 
algorithms are able to do so (or can be modified to this end). 



1 Introduction 

Visualization techniques are frequently used to analyze the input to a machine 
learning algorithm. This paper presents a generic method for visualizing the 
output of classification models that produce class probability estimates. The 
method has previously been investigated in conjunction with Bayesian network 
classifiers [5] . Here we provide details on how it can be applied to other types of 
classification models. 

There are two potential applications for this technique. First, it can help the 
user understand what kind of information an algorithm extracts from the input 
data. Methods that learn decision trees and sets of rules are popular because 
they represent the extracted information in intelligible form. This is not the case 
for many of the more powerful classification algorithms. Second, it can help ma- 
chine learning researchers understand the behavior of an algorithm by analyzing 
the output that it generates. Standard methods of assessing model quality — for 
example, receiver operating characteristic (ROC) curves [4] — provide informa- 
tion about a model’s predictive performance, but fail to provide any insight into 
why a classifier makes a particular prediction. 

Most existing methods for visualizing classification models are restricted to 
particular concept classes. Decision trees can be easily visualized, as can decision 
tables and naive Bayes classifiers [8] . In this paper we discuss a general visualiza- 
tion technique that can be applied in conjunction with any learning algorithm for 
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classification models as long as these models estimate class probabilities. Most 
learning algorithms fall into this category^. 

The underlying idea is very simple. Ideally we would like to know the class 
probability estimate for each class for every point in instance space. This would 
give us a complete picture of the information contained in a classification model. 
When there are two attributes it is possible to plot this information with arbi- 
trary precision for each class in a two-dimensional plot, where the color of the 
point encodes the class probability (e.g. black corresponds to probability zero 
and white corresponds to probability one) . This can easily be extended to three 
dimensions. However, it is not possible to use this simple visualization technique 
if there are more than three attributes. 

This paper presents a data-driven approach for visualizing a classifier regard- 
less of the dimension of the instance space. This is accomplished by projecting 
its class probability estimates into two dimensions. It is inevitable that some 
information will be lost in this transformation but we believe that the resulting 
plotting technique is a useful tool for data analysts as well as machine learning 
researchers. The method is soundly based in probability theory and aside from 
the classifier itself only requires an estimate of the attributes’ joint density (e.g. 
provided by a kernel density estimator). 

The structure of the paper is as follows. Section 2 describes the visualization 
technique in more detail. Section 3 contains some experimental results. Section 4 
discusses related work, and Section 5 summarizes the paper. 



2 Visualizing Expected Class Probabilities 

The basic idea is to visualize the information contained in a classification model 
by plotting its class probability estimates as a function of two of the attributes in 
the data. The two attributes are user specified and make up the x and y axes of 
the visualization. In this paper we only consider domains where all the attributes 
are numeric. We discretize the two attributes so that the instance space is split 
into disjoint rectangular regions and each region corresponds to one pixel on the 
screen. The resulting rectangles are open-sided along all other attributes. Then 
we estimate the expected class probabilities in each region by sampling points 
from the region, obtaining class probability estimates for each point from the 
classification model, and averaging the results. The details of this method are 
explained below. 

The probability estimates for each region are color coded. We first assign a 
color to each class. Each of these colors corresponds to a particular combination 
of RGB values. Let {rk,gk, bk) be the RGB values for class k, i.e. if class k gets 
probability one in a given region, this region is colored using those RGB values. 
If no class receives probability one, the resulting color is computed as a linear 
combination of all the classes’ RGB values. Let Bk be the estimated expected 



^ Note that the technique can also be used in conjunction with clustering algorithms 
that produce cluster membership probabilities. 
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probability for class k in & particular region. Then the resulting RGB values are 
computed as follows: 

r = '^ekXrk g = '^ek^9k 6 = ^ 4 x 5^ (1) 

k k k 



This method smoothly interpolates between pure regions — regions where one 
class obtains all the probability mass. 

The class colors can be chosen by the user based on a standard color chooser 
dialog. For example, when there are only two classes, the user might choose 
black for one class, and white for the other, resulting in a grey-scale image. Note 
that the colors will not necessarily uniquely identify a probability vector. In a 
four-class problem setting the corresponding colors to (1, 0, 0), (0, 1, 0), (0, 0, 1), 
and (0,0,0) will result in a one-to-one mapping. However, with other color set- 
tings and/or more classes there may be clashes. To alleviate this problem our 
implementation shows the user the probability vector corresponding to a certain 
pixel on mouse over. It also allows the user to change the colors at any time, so 
that the situation in ambiguous regions can be clarified. 

We now discuss how we estimate the expected class probabilities for each 
pixel (i.e. each rectangular region in instance space). If the region is small enough 
we can assume that the density is approximately uniform within the region. In 
this case we can simply sample points uniformly from the region, obtain class 
probability estimates for each point from the model, and average the results. 
However, if the uniformity assumption does not hold we need an estimate / of the 
density function — for example, provided by a kernel density estimator [2] — and 
sample or weight points according to this estimate. Using the density is crucial 
when the method is applied to instance spaces with more than two dimensions 
(i.e. two predictor attributes) because then the uniformity assumption is usually 
severely violated. 

Given a kernel density estimator / we can estimate the expected class prob- 
abilities by sampling instances from a region using a uniform distribution and 
weighting their predicted class probabilities pk according to /. Let S = (xi, ..., x/) 
be our set of I uniformly distributed samples from a region. Then we can estimate 
the expected class probability Ck of class k for that region as follows: 



Exgs/(x)Pfc(x) 

Exes / W 



(2) 



If there are only two dimensions this method is quite efficient. The number 
of samples required for an accurate estimate could be determined automatically 
by computing a confidence interval for it, but the particular choice of I is not 
critical if the screen resolution is high enough (considering the limited resolution 
of the color space and the sensitivity of the human eye to local changes in color) . 

Unfortunately this estimation procedure becomes very inefficient in higher- 
dimensional instance spaces because most of the instances in S will receive a very 
low value from the density function: most of the density will be concentrated 
in specific regions of the space. Obtaining an accurate estimate would require a 
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very large number of samples. However, it turns out that there is a more efficient 
sampling strategy for estimating Cfc. This strategy is based on the kernel density 
estimator that we use to represent /. 

A kernel density estimator combines local estimates of the density based on 
each instance in a dataset. Assuming that there are n instances it consists of 
n kernel functions: 



/W = -^fci(x) (3) 

1—1 

where ki is the kernel function based on instance x^: 

m 

= (4) 

i=i 

This is a product of m component functions, one for each dimension. We use a 
Gaussian kernel, for which the component functions are defined as: 






1 ( {Xj-XijY\ 



( 5 ) 



Each kij is the density of a normal distribution centered on attribute value 
j of instance x^. The parameter cr^ determines the width of the kernel along 
dimension j. In our implementation we use aij = {maxj — mirij) x di, where 
maxj and mirij are the maximum and minimum value of attribute j, and di is 
the Euclidean distance to the fc-th neighbor of x^ after all the attributes’ values 
have been normalized to lie between zero and one. The value of the parameter k 
is user specified. Alternatively it could be determined by maximizing the cross- 
validated likelihood of the data [7]. 

Based on the kernel density estimator we can devise a sampling strategy 
that produces a set of instances Q by sampling a fixed number of instances from 
each kernel function. This can be done by sampling from the kernel’s normal 
distributions to obtain the attribute values for each instance. The result is that 
the instances in Q are likely to be in the populated areas of the instance space. 
Given Q we can estimate the expected class probability for a region R as follows: 






1 

|x G i? A X G Q| 



xG-RAxGQ 



( 6 ) 



Unfortunately this is not the ideal solution: for our visualization we want 
accurate estimates for every pixel, not only the ones corresponding to populated 
parts of the instance space. Most regions R will not receive any samples. The 
solution is to split the set of attributes into two subsets: the first set containing 
the two attributes our visualization is based on, and the second set containing 
the remainder. Then we can fix the values of the attributes in the first set so 
that we are guaranteed to get an instance in the area that we are interested in 
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(corresponding to the current pixel), sample values for the other attributes from 
a kernel, and use the fixed attributes to weight the resulting instance according 
to the density function. 

Let us make this more precise by assuming that our visualization is based 
on the first two attributes in the dataset. For these two attributes we fix values 
x\ and X 2 in the region corresponding to the pixel that we want to plot. Then 
we obtain an instance from kernel ki by sampling from the kernel’s normal 
distributions to obtain attribute values Xis, and setting xn = xi and 

Xi 2 = X 2 - We can then estimate the class probability Pk(xi,X 2 ) for location 
xi,X 2 as follows: 



Pk{xi,X2) 



J2'^=lPk{Xi)hlixi)ki2{x2) 
Ya=i kil{xi)ki2{x2) 



(7) 



This is essentially the likelihood weighting method used to perform probabilistic 
inference in Bayesian networks [6]. The Pk(p^i) are weighted by the kn{xi)ki 2 {x 2 ) 
to take the effect of the kernel on the two fixed dimensions into account. The 
result of this process is that we have marginalized out all dimensions apart from 
the two that we are interested in. 

One sample per kernel is usually not sufficient to obtain an accurate repre- 
sentation of the density and thus an accurate estimate of Pk{xi,X 2 ), especially in 
higher-dimensional spaces. In our implementation we repeat the sampling pro- 
cess times, where r is a user-specified parameter, evaluate Equation 7 for 

each resulting set of instances, and take the overall average as an estimate of 
Pk{xi,X 2 )- A more sophisticated approach would be to compute a confidence 
interval for the estimated probability and to stop the sampling process when a 
certain precision has been attained. 

Note that the running time can be decreased by first sorting the according 
to their weights ki\{xn)ki 2 {xi 2 ) and then sampling from the corresponding ker- 
nels in decreasing order until the cumulative weight exceeds a certain percentage 
of the total weight (e.g. 99%). Usually only a small fraction of the kernels need 
to be sampled from as a result of this filtering process. 

To obtain the expected class probability for a region corresponding to 
a particular pixel we need to repeat this estimation process for different loca- 
tions xii,xi 2 within the pixel and compute a weighted average of the resulting 
probability estimates based on the density function: 



where 



^k 



f{xii,Xi2)pk{xn,Xi2) 
Y.I f{xn,xi2) 



f{xn,xi2) = -y^kii{xii)ki2{xi2). 

71 ( ^ 



2 = 1 



(8) 

(9) 



This weighted average is then plugged into Equation 1 to compute the RGB 
values for the pixel. 
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3 Some Example Visualizations 



In the following we visualize class probability estimators on three example do- 
mains. We restrict ourselves to two-class domains so that all probability vectors 
can be represented by shades of grey. Some color visualizations are available 
online at http://www.cs.waikato.ac.nz/ml/weka/bvis. Note that our imple- 
mentation is included in the Java package weka.gui.boundaryvisualizer as 
part of the Weka machine learning workbench [9] Version 3.3.6 and later^. 

For each result we used two locations per pixel to compute the expected 
probability, set the parameter r to two (i.e. generating four samples per kernel 
for a problem with four attributes), and used the third neighbor (fc = 3) for 
computing the kernel width. 

The first domain is an artificial domain with four numeric attributes. Two 
of the attributes are relevant (xi and X2) and the remaining ones (xs and X4) 
are irrelevant. The attribute values were generated by sampling from normal 
distributions with unit variance. For the irrelevant attributes the distributions 
were centered at zero for both classes. For the relevant ones they were centered 
at —1 for class one and -1-1 for class two. We generated a dataset containing 100 
instances from each class. 

We first built a logistic regression model from this data. The resulting model 
is shown in Figure 1. Note that the two irrelevant attributes have fairly small co- 
efficients, as expected. Figure 2 shows the results of the visualization procedure 
for three different pairs of attributes based on this model. The points super- 
imposed on the plot correspond to the actual attribute values of the training 
instances in the two dimensions visualized. The color (black or white) of each 
point indicates the class value of the corresponding instance. 

Figure 2a is based on the two relevant attributes (xi on the x axis and 
X 2 on the y axis). The linear class boundary is clearly defined because the two 
visualization attributes are the only relevant attributes in the dataset. The lower 
triangle represents class one and the upper triangle class two. Figure 2b shows 
the result for xi on the x axis, and X 3 on the y axis. It demonstrates visually 
that Xi is relevant while X 3 is not. Figure 2c displays a visualization based on 
the two irrelevant attributes. It shows no apparent structure — as expected for 
two completely irrelevant attributes. 

Figure 4 shows visualizations based on the same pairs of attributes for the 
decision tree from Figure 3. The tree is based exclusively on the two relevant 
attributes, and this fact is reflected in Figure 4a: the area is divided into rect- 
angular regions that are uniformly colored (because the probability vectors are 
constant within each region). Note that the black region corresponds to three 
separate leafs and that one of them is not pure. The difference in “blackness” is 
not discernible. 

Figure 4b shows the situation for attributes xi (relevant) and X 3 (irrelevant). 
Attribute Xi is used twice in the tree, resulting in three distinct bands. Note that 



Available from http://www.cs.waikato.ac.nz/ml/weka. 
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p{class — one|x) — 77x1+4. 21x2-0. 15x3-0. 14x4-1. os 

Fig. 1. The logistic regression model for the artificial dataset. 




Fig. 2. Visualizing logistic regression for the artificial data using (a) the two relevant 
attributes, (b) one relevant and one irrelevant attribute, and (c) the two irrelevant 
attributes. 



the three bands are (almost) uniformly colored, indicating that the attribute on 
the y axis (xs) is irrelevant. 

Figure 4c is based on the two irrelevant attributes. The visualization shows 
no structure and is nearly identical to the corresponding result for the logistic 
regression model shown in Figure 2c. Minor differences in shading compared to 
Figure 2c are due to differences in the class probability estimates that are caused 
by the two relevant attributes (i.e a result of the differences between Figures 4a 
and 2a). 

For illustrative purposes Figure 6 shows a visualization for a two-class version 
of the iris data (using the 100 instances pertaining to classes iris-virginica 
and iris-versicolor) based on the decision tree in Figure 5. The iris data 
can be obtained from the UCI repository [1]. In Figure 6a the petallength 
attribute is shown on the x axis and the petalwidth attribute on the y axis. 
There are four uniformly colored regions corresponding to the four leaves of the 
tree. In Figure 6b petallength is on the x axis and sepallength on the y 
axis. The influence of sepallength is clearly visible in the white area despite 
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Fig. 3. The decision tree for the artificial dataset. 




(c) 



Fig. 4. Visualizing the decision tree for the artihcial data using (a) the two relevant 
attributes, (b) one relevant and one irrelevant attribute, and (c) the two irrelevant 
attributes. 



this attribute not being used in the tree. Figure 6c is based on sepallength 
(x) and sepalwidth (y). Although these attributes are not used in the tree 
the visualization shows a clear trend going from the lower left to the upper 
right, and a good correlation of the probability estimates with the actual class 
values of the training instances. This is a result of correlations that exist between 
sepallength and sepalwidth and the two attributes used in the tree. 
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Fig. 5. The decision tree for the two-class iris dataset. 




Fig. 6. Visualizing the decision tree for the two-class iris data using (a) petallength 
and petalwidth, (b) petallength and sepallength, and (c) sepallength and 
sepalwidth (with the first attribute on the x axis and the second one on the y axis). 



This particular example shows that the pixel-based visualization technique 
can provide additional information about the structure in the data even when 
used in conjunction with an interpretable model like a decision tree. In this 
case it shows that the decision tree implicitly contains much of the information 
provided by the sepallength and sepalwidth attributes (although they are not 
explicitly represented in the classifier). 

To provide a more realistic example Figure 8 shows four visualizations for 
pairs of attributes from the pima- Indians diabetes dataset [1]. This dataset 
has eight attributes and 768 instances (500 belonging to class tested_negative 





Visualizing Class Probability Estimators 177 



QplasJ 



26.4 j> 26.4 

Lnegative (132.0/3.0) 

^-^"■<=28 1 >28 



> 127 

(^mass^y^ 

J <= 29 . 9 ‘' 



/ <= 145 \ > 145 
sted_negative (180.0^.0) ^las} tested_negative (41.0/6.0) 

99 \ > 99 25 1 > 25 

tC8tcd_ocgative (55.0/10.0) C^pedC t 
/"■'<= 0.561 I > 0.561 



|<= 157 '\^> 157 
(^res^ testedjwsitive (92.0/12.0) 

^ |<=6r"\^>6i 

negative (4.0) (fage^ testedj)Ositive (15.0/1.0) 

yV <=61 \ >61 T <=30 

(^tnas^ tested_negative (4.0) tested_negative (40.0/13.0) 



tested_negative (84.0/34.0) ^preg^ 

^''''<=6"|">6 I <=27.1 "\> 27.1 

('age^ tested.j)ositive(13.0) tested j)ositive (12.0/1.0) (pres^ 

/V= so'x > 30 <= 82\ > 82 

tested jwsitive (4.0) (age^ (^pediy tested_negative (4.0) 

34\^> 34 ^ <= 0.396 > 0.396 

tested_negative (7 .0/1.0) ^mass^ tested_positive (8.0/1.0) tested_negative (3.0) 

<=334\^ 33.1 

tested_positivc (6.0) tested_negative (4.0/1 .0) 



>30 

id_p'ositive (60.0/17.0) 



Fig. 7. The decision tree for the diabetes dataset. 




Fig. 8. Visualizing the decision tree for the diabetes data using (a) plas and mass, (b) 
preg and pedi, (c) pres and age, and (d) skin and insu (with the first attribute on 
the X axis and the second one on the y axis). 
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and 268 to class tested_positive). The decision tree for this problem is shown 
in Figure 7. 

Figure 8a is based on the plas {x axis) and mass {y axis) attributes, which 
are tested at the root and the first level of the decision tree respectively (as 
well as elsewhere in the tree). This makes them likely to be the most predictive 
attributes in the data. There are eight nodes in the tree where either one of 
these attributes is tested. This results in nine distinct rectangular regions, of 
which eight are visible in the visualization. The missing region is a result of the 
last split on the mass attribute at the bottom of the left subtree and hidden by 
the points in the middle of the plot. There are two regions where the classifier 
is certain about the class membership: a white region in the lower left corner, 
and a black region in the upper right one. These correspond to the left-most 
and right-most paths in the tree respectively, where only the two visualization 
attributes are tested. All other regions involve tests on other attributes and are 
therefore associated with greater uncertainty in the class membership. 

Figure 8b is based on preg {x) and pedi (y). They are tested at three nodes 
in the tree — all below level four — resulting in four rectangular regions. Only 
three of these are discernible in the plot; the fourth one corresponds to the split 
on pedi at the bottom of the right subtree (and is very faintly visible on screen 
but not in the printed version). 

In Figure 8c the visualization is based on pres (x) and age (y), tested eight 
times in total. Note that some of the rectangular regions are the consequence 
of overlapping regions originating from splits in different subtrees because the 
subtrees arise by partitioning on non-visualized attributes. 

Figure 8d visualizes the model using the only two attributes that do not 
occur in the decision tree: skin (x) and insu (y). Again, like in the iris data 
(Figure 6b), there is some correlation between the actual class labels and the 
attributes not explicitly taken into account by the classifier. However, in this 
case the correlation is very weak. 

4 Related Work 

There appears to have been relatively little attention devoted to general tech- 
niques for the visualization of machine learning models. Methods for particular 
types of classifiers have been developed, for example, for decision trees, decision 
tables, and naive Bayesian classifiers [8], but in these cases the visualization 
procedure follows naturally from the model structure. 

The structure of Bayesian networks can be visualized as a directed graph. 
However, the graph-based visualization is limited because it does not provide any 
visualization of the probability estimates generated by a network. Rheingans and 
desJardins [5] apply the basic pixel-based visualization technique discussed in 
this paper to visualize these estimates. They indicate that the technique can be 
used in conjunction with other types of class probability estimators but do not 
provide details on how this can be done. Inference methods for Bayesian networks 
directly provide estimates of conditional probabilities based on evidence variables 
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(in this case the two attributes used in the visualization), and this means a 
separate density estimator and sampling strategy is not required. Rheingans 
and desJardins also investigate a visualization technique that maps the instance 
space into two dimensions using a self-organizing map (SOM) [3]. However, this 
makes it difficult to relate a pixel in the visualization to a point in the original 
instance space. 

5 Conclusions 

This paper has presented a generic visualization technique for class probability 
estimators. The basic method is not new and has been investigated in the context 
of Bayesian network classifiers before. Our contribution is that we have provided 
details on how to generalize it to arbitrary classification models that produce 
class probability estimates. We have provided some example visualizations based 
on logistic regression and decision trees that demonstrate the usefulness of this 
method as a general tool for analyzing the output of learning algorithms. Poten- 
tial applications are two fold: practitioners can use this tool to gain insight into 
the data even if a learning scheme does not provide an interpretable model, and 
machine learning researchers can use it to explore the behavior of a particular 
learning technique. 



Acknowledgments 

Many thanks to Len Trigg for pointing out that the method is applicable to prob- 
abilistic clustering algorithms, and to Geoff Holmes and Bernhard Pfahringer for 
their comments. This research was supported by Marsden Grant Ol-UOW-019. 



References 

1. C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998. 
[www.ics.uci.edu/~mlearn/MLRepository.html]. 

2. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical 
Learning: Data Mining, Inference, and Prediction. Springer- Verlag, 2001. 

3. T. Kohonen. Self-Organizing Maps. Springer- Verlag, 1997. 

4. Foster J. Provost and Tom Fawcett. Analysis and visualization of classiher per- 
formance: Comparison under imprecise class and cost distributions. Knowledge 
Discovery and Data Mining, pages 43-48, 1997. 

5. Penny Rheingans and Marie desJardins. Visualizing high-dimensional predictive 
model quality. In Proceedings of IEEE Visualization 2000, pages 493-496, 2000. 

6. Stuart Russell and Peter Norvig. Artificial Intelligence. Prentice-Hall, 1995. 

7. P. Smyth. Model selection for probabilistic clustering using cross-validated likeli- 
hood. Statistics and Computing, pages 63-72, 2000. 

8. Kurt Thearling, Barry Becker, Dennis DeCoste, Bill Mawby, Michel Pilote, and Dan 
Sommerfield. Information Visualization in Data Mining and Knowledge Discovery, 
chapter Visualizing Data Mining Models. Morgan Kaufmann, 2001. 

9. Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and 
Techniques with Java Implementations. Morgan Kaufmann, 2000. 




Automated Detection of Epidemics from the 
Usage Logs of a Physicians’ Reference Database 



Jaana Heino^ and Hannu Toivonen^ 

^ National Public Health Institute, Helsinki, Finland 
Department of Computer Science, University of Helsinki 
jaana. he inoScs .helsinki . f i 
^ Department of Computer Science, University of Helsinki 
hannu. toivonenScs .helsinki . f i 



Abstract. Epidemics of infectious diseases are usually recognized by an 
observation of an abnormal cluster of cases. Usually, the recognition is 
not automated, and relies on the alertness of human health care workers. 
This can lead to significant delays in detection. Since real-time data from 
the physicians’ offices is not available. However, in Finland a Web-based 
collection of guidelines for primary care exists, and increases in queries 
concerning certain disease have been shown to correlate to epidemics. 
We introduce a simple method for automated online mining of probable 
epidemics from the log of this database. The method is based on deriving 
a smoothed time series from the data, on using a flexible selection of data 
for comparison, and on applying randomization statistics to estimate 
the significance of findings. Experimental results on simulated and real 
data show that the method can provide accurate and early detection of 
epidemics. 



1 Introduction 

The usual way of recognizing an infectious disease epidemic is through an obser- 
vation of clustering (temporal, geographical, or both) of new cases by someone 
involved either with the diagnosis of patients or a disease registry. When the 
epidemic is widely spread - either spatially, temporally, or both - it might be 
very difficult or outright impossible for a single individual on the field to no- 
tice the change, and registries only receive notification after the patient has met 
a physician, laboratory samples have been analyzed, and reports filed, which 
causes delay. 

In certain diseases, however, early detection would be very desirable. This 
is the case in, for instance, rare diseases whose etiology is not clear, in order 
to begin epidemiological studies as early as possible. Another example of the 
benefits of early detection comes from the detection of food-borne epidemics, 
where control measures are often much easier to conduct in the early phases of 
the epidemic. 

If the registries could utilize diagnostic hypotheses made by physicians as 
soon as the patient has met the doctor, the delay would be eliminated. Unfor- 
tunately, such data is not available. However, in Finland a database exists that 
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might be a way around this problem. The Physician’s Reference Database [1] is a 
collection of medical guidelines for the primary care, from which physicians often 
seek information about infectious diseases [2] . According to a preliminary study, 
an increase in the rate of database searches about a certain disease sometimes 
correlates to the onset of an epidemic [3] . 

In this work we develop a simple method for automatic detection and evalu- 
ation of such increases. 



What Is an Epidemic? Incidence, in epidemiology, is defined as the number of 
new cases per unit of time. Prevalence refers to the number of the people with 
the certain characteristic (e.g. a certain disease) in the population at any given 
moment; disease prevalence is thus dependent on the incidence of the disease 
and the duration of the disease. 

Incidence and prevalence are well-defined concepts. An epidemic, on the other 
hand, is more difficult to define precisely. In mathematical modeling of infectious 
diseases (e.g. [4]) an epidemic is said to occur if the introduction of an infectious 
agent to a population causes people (other than the first introducer) to get sick 
through people transmitting the disease to each other. 

This definition is, however, useful only in theoretical modeling, or when the 
actual incidence in the absence of an epidemic is zero. Many diseases have a 
certain baseline incidence: when a steady incidence and prevalence are considered 
the normal situation, even if the transfer happens person-to-person. Often, we 
also call a cluster of cases an ’’epidemic” even though no actual person-to-person 
transmission happens. What ”an epidemic” is thus depends on the observer’s 
subjective goals and estimates of local conditions. 

Many infectious diseases have a seasonal cycle cycle: the incidence of the 
disease increases and decreases with a certain, steady interval. Some diseases, 
like influenza, have yearly peaking epidemics. Some have simply a slightly higher 
incidence during a certain time of year: for instance, food- and water-borne 
diseases are more common in warm weather. Some diseases have a longer cycle: 
for example, a major Pogosta disease epidemic occurs about every seven years 
in Finland. 

A generic goal of automated detection of epidemics is to discover the begin- 
ning of any epidemic, whether cyclic or not. For cyclic diseases, an alternative 
goal is to consider the cyclic variation normal and to detect incidences that are 
exceptionally high given the normal cycle. 



Requirements for Online Surveillance of Epidemics. It is of course important 
that a detection system achieves high sensitivity (proportion of epidemics de- 
tected), so that epidemics are not overlooked. It is also important that epidemics 
are detected as soon as possible after their onset. However, false alarms sev- 
erly undermine the credibility of the warning system, which might result in the 
users no longer taking the warnings seriously. Thus, high specificity (proportion 
of non-epidemic periods classified as non-epidemic), leading to a higher posi- 
tive prediction value (probability of epidemic given that the system outputs an 
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alarm), should be a priority even at the cost of some reduction in the detection 
speed. 

With a good system, the user can specify the period to which the present 
moment is compared: for instance, it should be possible to ask ”is this week 
different from the previous n months”, ”is this month different from the same 
month during previous years”, and several other questions like that, with the 
same method. 

The method we introduce allows such flexibility and different treatments of 
cyclic diseases. We will analyze the sensitivity, specificity and positive prediction 
value of the method on both synthetic and real data. 



2 Physician’s Reference Database 
and Data Preprocessing 



The Physician’s Reference Database is a web-based collection of medical guide- 
lines used by physicians. The database consists of thousands of articles, each 
described by a number of keywords (typically names of diseases, symptoms and 
findings). Reading events are recorded in a usage log, allowing one to mine for 
physicians’ active interests in different diseases. 

Let A be the set of all articles in the database, and i = 1,2, ..., n the sequence 
of days under surveillance. The raw data consists of the number of reading events 
in a day, D{a)i, for each article a £ A. Each of the articles has associated key- 
words. Let A{k) be the set of articles that contain disease k among its keywords. 
For each day i and each keyword k, the daily total count of events of all relevant 
articles is 

D{k)i = ^ D{a)i. 

aGA{k) 

Usage data from the reference database is available from October 1st 2000 on- 
ward; in the analysis in this work we use data until September 30th 2002. On 
average, there were 1465 reading events (a user viewing an article) per day, by 
all users total. The event counts show a notable upward trend: during the first 
100 days the average was 633.9 events and during the 100 last days 2696.1. 

As trends related to the changing usage of the database are not relevant, it 
is necessary is to ’’normalize” daily counts in relation to the overall database 
usage. We divide the daily event count per keyword by the total event count, 
giving the basic unit of our data, the proportional daily event count: 



d{k). 



D{k), 



There are some potential problems with this normalization approach: something 
that is likely to affect the keyword-specific event counts is also likely to affect the 
total count. This might cause artefacts, ie. trends or peaks that are not present 
in the original data, or ’’dilution” of a smaller epidemic by a bigger one. At the 
moment we do not try to counteract such possible side effects. 
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3 The Method 

The goal is to detect possible epidemics, observable as exceptionally high pro- 
portional daily event counts, and to output an alarm as soon as possible after the 
beginning of an epidemic. We develop a simple randomization-based framework 
to recognize significant increases in event counts. 

The series is first smoothed to remove some of the daily variations while 
retaining most of the trends. Two smoothing methods, namely sliding average 
and sliding linear regression, are used (see below). 

A null hypothesis period, a sequence of days from the past to which the 
present moment is compared, is chosen. The way the null hypothesis period is 
designated determines the exact question we are trying to answer. When the 
present moment is compared to all past non-epidemic times, the question is 
”is there an epidemic now”. Other examples include the last n days (’’has the 
situation changed for the worse recently”), the same months during previous 
years (”is this June different from what is typical for previous Junes”), and all 
previous epidemic times (”is there an epidemic now that is worse than typical 
epidemics of this disease”). 

We assume that in the absence of an epidemic the proportional daily event 
counts for a given disease are independent and identically distributed, but note 
that the independence assumption is most likely not completely true. A person 
that has read a lot about some disease recently is less likely to review that 
information than he would be if he knew nothing about the subject. Modeling 
such dependencies would be tedious, though, and we hope that this dependency 
averages out among all users. 

As we do not know the true distribution behind the data, we cannot obtain 
a p- value or other comparison based on that. Instead, we use randomization 
statistics. The null hypothesis period is sampled with replacement for samples the 
size of the smoothing function’s window^, and the smoothed value is calculated 
for each of these samples. The resulting empirical distribution is used as the 
distribution of the smoothed values under the null hypothesis of no epidemic, 
and the p- value of the day in question is taken to be the proportion of the sampled 
sequences having the same or a higher smoothed value. (For more information 
on randomization statistics, see for instance [6].) 

Testing every day like this could cause a bad positive prediction value for 
two reasons. First, if the proportion of negatives to positives in the set of objects 
tested rises, even a good specificity causes bad prediction values eventually. Sec- 
ond, the detection method itself includes randomization, and thus will eventually 
err if run repeatedly. 

To avoid this problem, we require that the smoothed value of a day is both 
high (when compared with other observed values) and statistically significant 
(tested with randomization). The first requirement is fulfilled by checking if 
the smoothed value of that day is higher than a certain percentile of smoothed 

® Note that here we sample new sequences, that is, individual points until we have w 
points, not windows of size w from the original series. 
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values of the original series in the null hypothesis period. Thus, the tested set of 
days is limited to a subset of all days having a high value of the statistic under 
surveillance. The use of a cut-off value can also be seen as a crude and quick 
estimate of our statistical test. 

To put this together, given a time series, a window length ru > 1, a cutoff 
value c G [50, 100[, a smoothing function / from a window in the series (a run 
of consecutive points) to a real number, a null hypothesis period, and a p- value, 
the method works as follows: 

(1) for the null hypothesis period, calculate the ic-day smoothed values ac- 
cording to /, store these in 

(2) check if today’s smoothed value exceeds the c’th percentile of all smoothed 
values in S'! , and if it does: 

(3) resample samples of size w from the null hypothesis period of the 
original non-smoothed series, and calculate the smoothed value for each 
of these windows, store these in S 2 

(4) determine the proportion of values in S 2 that are the same or higher 
than the value for today, and if that proportion is lower than p: 

(5) output an alarm, together with the proportion. 



The (non-weighted) w-day sliding average is calculated simply by replacing 
each data point with an average of w days. This can be done either using time 
points on both sides of the day, or using the previous tc — 1 days. As in this 
problem future data is not available, we use the latter method: 

W 

SAi = ^ [ di-j-\-i 

i=i 

Sliding linear regression, on the other hand, works by fitting a line with the least 
squares over the w days [z — ic -I- 1, t], and then taking the smoothed value at i to 
be the value of this linear function at that point. The benefit of this smoothing 
method is that it reacts faster to abrupt changes in the series; in a way the 
sliding linear function exaggerates the linear tendencies in the series, while the 
sliding average tries to smooth them out. 

Naturally, the longer the window, the less short-term changes affect the 
smoothed series. However, as the window stretches only backward in time, as 
opposed to backward and forward, this causes a lag in the smoothed curves’ 
reaction to changes. See Figure 1 for an example series and smoothings on it. 

4 Test Results 

4.1 Results on Artificial Test Data 

Forty artificial test series were constructed to test the performance of the al- 
gorithm with different parameters. Each series is 700 time points long. 20 are 
constructed from an exponential distribution, another 20 from a normal distri- 
bution with values below zero replaced by zero. 10 of each type of series had 
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no epidemics; in the other 20 datasets timepoints [351,450] were replaced by 
samples drawn from a similar distribution as the main series, but with a higher 
mean. Parameters for the distributions were chosen based on the means and 
variations in the real life test data. See Figure 1 for an example series. 

In all the tests on artificial data, the null hypothesis period for timepoint i 
is [1, i — m] for the non-epidemic series and [1, t — w] \ [351, 450] for the epidemic 
ones. As we know for certain which days are epidemic and which are not, we can 
calculate sensitivity, specificity and delays exactly. No epidemic was completely 
missed, giving an epidemics-wise sensitivity of exactly 1 for all settings. Below, 
we have explored the sensitivity and specificity day-wise, that is, the algorithm 
is expected to mark each day either belonging to an epidemic or not. We also 
examine the detection delay, defined as the number of false negatives from the 
first day of an epidemic (time point 351) until the first true positive during the 
epidemic. 

In practice, 20,000 samples were enough to produce steady results on the 
randomization tests; 80,000 were used for the sliding average and 20,000 for the 
sliding linear, due to the first one’s Matlab implementation being so much faster 
that the extra certainty was worthwhile. Unless mentioned otherwise, the p- value 
is 0.01. When calculating the performance statistics, the first one hundred days 
are ignored. 

Figure 2. a shows the day-wise specificity of the algorithm for all the test 
series with different parameter values. Note how specificity drops in the sliding 
average tests. This is mostly due to the fact that the longer the window, the 
longer it takes after the epidemic period before the smoothed values return to 
the baseline. As a faster-reacting function, sliding linear regression does not suffer 
from similar problems, but the distribution of the specificities widens when the 
window grows: more series are detected with 100 % specificity, but some epidemic 
periods are also less well detected than they would have a shorter window. 




Fig. 1. A portion of an artificial time series (gray bars) with an “epidemic” in the 
middle of it (bordered by the vertical black lines). On the left; a 14-day sliding average 
(thin line) and a 30-day sliding average (thick line). On the right, a 30-day sliding 
average (again the thick line) and a 30-day sliding linear regression (thin line). 
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Fig. 2. Performance of the algorithm 
for cutoff = 99, p — 0.01. In each 
subfigure, from the left, sliding average 
with windows 14 and 30, and sliding 
linear regression with 14, 30 and 40. 
Circles and squares represent individ- 
ual test series. Stars connected by a line 
show the average performance figure. 
Shown are a) specificity, b) sensitivity, 
c) delay. 



As one would expect, specificity is dependent on the parameters cutoff and 
p. If window length is kept constant and cutoff and p varied, keeping cutoff = 
100(1 — p), the specificity is linear on this double parameter (not shown in the 
figure). 

Figure 2.b shows the respective day-wise sensitivity for all the series. Here we 
see the phenomenon that lowering the window length below a certain limit causes 
a notable decline in sensitivity. This happens because when the window grows, 
the distributions of the smoothed values during epidemic and non-epidemic be- 
come narrower and overlap less, making them easier to distinguish. Figure 2.c 
gives the delay in detecting the epidemic, again for the same parameters. As 
expected, the average delay rises with the window length. Apart from the one 
outlier, all delays are below 20 days and the vast majority of them below 10 
days. 

One could argue that shortening the window length to shorten the delay is 
worthwhile, since as the epidemic still is detected the lower sensitivity does not 
matter. To some extent this is true. However, one important way to tell a real 
epidemic from a false positive is whether the situation persists. False positives 
appear singly or in short clusters, real epidemics last for several days or weeks. 
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While we can afford to lose some sensitivity in order to gain a shorter delay, we 
cannot let it go altogether. 

Experimenting on the effect of window length, keeping other parameters con- 
stant, revealed that when the window length shortens, sensitivity stays reason- 
ably good (that is, around 80-90 %) up to a point, and then drops steeply. This 
drop happens around window length 15 for the sliding average and around win- 
dow length 40 for the sliding linear regression (at least on the data used in this 
work). Comparing the specificity and sensitivity of the smoothing methods on 
these lengths, it can be seen see that while there are some differences on certain 
series between the methods, their overall performance is close to equal, and that 
there are no systematic differences depending on the type of the series. (Test 
results not shown.) 

4.2 Results on Real Life Data 

Ten diseases were selected as test targets, namely hepatitis A, influenza, diph- 
theria, legionellosis, Pogosta disease, polio, parotitis, tularemia, varicella and 
measles. The keyword for each is the (Finnish) name of the disease, and each 
test series is 729 days long. Parameters used are cutoff = 99,p = 0.01, and 
window lengths 14 (sliding average) and 40 (sliding linear regression). 

Unlike with artificial data, we do not now have conclusive knowledge of which 
days are epidemic and which are not; the only definite example of an epidemic 
interesting from the public health point of view is the Pogosta epidemic of 2002, 
which began in August (Figure 3. a). So we count negatives after August 1st 
2002 (day 668) in that series as false negatives, and exclude timepoints from day 
680 onward from the null hypothesis period. In other series the null hypothesis 
period is [1, i — w\. 

Delay of Detection. The Pogosta epidemic was detected on day 665 by the sliding 
average, and on day 666 by the sliding linear regression method (see Figure 3. a). 
Compared to the shape of the actual epidemic (the thick line at the bottom), we 
can see that this is remarkably early. The epidemic curve is drawn based on the 
day the diagnostic blood sample was taken, which is likely to be the same day as 
the day after the patient’s physician first suspected the disease. Even if we had 
data straight from the physician’s office, we could not have seen the epidemic 
much earlier. (Data for the epidemic curve is based on cases later notified to the 
Infectious Diseases Registry of the National Public Health Institute.) 

On the sliding average there was a short alarm peak also at time-points 636- 
639. It is unclear whether this is a true first alarm of the epidemic, or a sequence 
of false positives; in the following it is counted as the latter. No days were falsely 
classified as negative during the epidemic. 

Sensitivity, Specificity and Positive Prediction Value. Over all the series, in the 
sliding average, if we consider only the Pogosta epidemic of 2002 as true positives, 
specificity was 99.02 %, and the positive prediction value 51.2 %. The sliding 
linear regression had specificity 99.00 % and positive prediction value of 50.5 %. 
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Fig. 3. The output of the algorithm in some situations, a. The Pogosta series with 
the output of both smoothing methods. Thin line, sliding average; thick line in the 
middle, sliding linear regression. The null hypothesis period is the whole preceding 
period minus timepoints 680 and onward. The bottom-most thick line shows the shape 
of the actual epidemic: weekly incidence according to the day the hrst diagnostic blood 
sample was taken, which probably is the same day or close to the day that the patient’s 
physician first suspected the disease, b. The tularemia series with the output of the 
sliding average smoothing method. Thin line, the null hypothesis period is the whole 
preceding series; thick line, it is the preceding 180 days. 



With the two methods, continuous runs of ’’false positives” happened in 
four series (legionella, mumps series, Pogosta, and tularemia) in about the same 
places. Looking at the series, and bearing in mind that the detection is based 
only on time previous to those time-points, these four periods of alarms seem 
reasonable and even desirable. See Figure 3 for two of the cases. 

If we count the positives during these periods as true positives, we get a 
positive predictive value of 85.3 % for the sliding average and 90.1 % for the 
sliding linear regression. (Stating specificity would require arbitrarily determin- 
ing which days, if any, around these positives are also positives.) In the real 
situation, where the series under surveillance and the null hypothesis periods 
are chosen by an epidemiologist, the positive prediction value (defined through 
the usefulness of the alarm) will probably be somewhere between these estimates 
of 50 and 90 %. 



4.3 Changing the Null Hypothesis Period 

Figure 3.b demonstrates the effect of the null hypothesis period. The thin line 
shows the output when the null hypothesis period is the whole preceding series; 
the thick line when it is last 180 days. The yearly tularemia epidemic during 
the second year does not cause an alarm when the epidemic during the previous 
year is included in the null hypothesis, and does cause an alarm when it is not. 
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Another interesting feature is also visible in the lower output. Looking at 
the second predicted epidemic (from timepoint 600 onward), we can see that 
there first is an alarm period, but when the alarm has been on for some time, 
it ceases. When the epidemic rises again, new alarms are put out. This happens 
because the epidemic time is now not excluded from the null hypothesis. Thus 
’’epidemics inside epidemics” can be detected. 

5 Technical Comments 

The time requirement of the method for the check on one day is linear to the 
number of iterations i performed in the randomization and the window length 
w, as taking a random sample and calculating a mean or fitting a line are all 
linear. Thus, running the check for m diseases (which is what would normally be 
done daily) with i iterations is 0{wmi), and calculating the results for n days is 
0{nwmi). Since typically w « i, these are close to 0{mi) and 0{nmi). 

Currently, data arrives from the system administrator of the actual database 
daily, as a text file. The data is read into an (Oracle RDB) database, and the 
reading events per keyword are calculated and stored. The raw article counts 
are also stored, to make it simpler to add a new disease keyword to the base 
(without the need to reread the data files). When the amount of data grows 
too large, article counts might be preserved for perhaps two years, and only 
keyword-specific values stored for a longer period, enabling the beginning of a 
new surveillance with some data, but not requiring too much space. 

The test versions of the algorithms as explained in this work were imple- 
mented on Matlab. Conversion to a Java program for end users is planned. 



6 Related Work 

Basically, most methods of online detection of changes in a time series fall into 
two categories. In the first, we compare each value, at the time of its arrivel, 
to some baseline value calculated from previous data, and decide if the new 
value is ’’different enough” to be considered abnormal [7,8]. In the second, we fit 
some model, often a curve - constant [9], linear [10,11], or more complex [12] - 
piecewise to the data, searching for the change-points. 

Other than these two approaches, data mining of time series data has been 
studied mainly from the point of view of mining inter-series relations in either 
patterns (slopes, peaks) or in concurrently occurring different events (see, for 
example, [13,14]), which approaches are not directly related to the problem of 
this study. 

Piatetsky-Shapiro and Matheus were among the first data miners to investi- 
gate deviations in time series data [7]. Their basic concept is a deviation, defined 
as a difference between an observed value and a normative value. In addition, 
they categorize the deviations based on their interestingness, defined as the util- 
ity of the finding to a user. 
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Stern and Lightfoot describe a system for detecting clusters of human infec- 
tion with enteric pathogens [8]. In it, a smoothing technique is used to determine 
normal baseline incidences for each pathogen, area and time-of-year, based on 
data from several previous years. Then weekly counts of cases are compared to 
a threshold calculated from the base incidence. The system achieves great sen- 
sitivity, over 90 %, but the positive predictive value remains at about 60 % or 
lower - meaning that almost every other alarm is false. 

The problem with these approaches is that calculating the normative values 
requires data from several years, and that they give accurate results only if 
the difference between normal and abnormal values is rather sharp. Another 
problem is that the calculation of normative values requires some knowledge or 
assumption about the shape of the distribution of the values; in our case such 
knowledge is not readily available. 

Ogden and Sugiura [10] describe test statistics for determining whether a 
time series has undergone a change. The change is defined as a linear change in 
the parameters of the underlying distribution: the parameter vector is 0 from 
the beginning of the series to some timepoint ti, then changes in a linear way 
until it reaches 0 + S at tj,j > i. The null hypothesis tested is i5 = 0. However, 
the tests cannot be applied online to decide if there has been a change recently, 
and require information on the distribution of the data. 

Keogh et al. [11] explore segmenting of a time series into piecewise linear 
representation, relative to an error criterion stating whether a line fit is ’’good 
enough” (for example, the total error must not exceed a certain value). They 
describe three basic greedy approaches, only one of them online, and a combina- 
tion online algorithm that performed well. Guralnik and Srivastava [12] suggest 
not restricting the function to be fitted to the segments to lines (for instance, 
one could allow the algorithm to choose the best fit of 0-3 degree polynomials, 
instead). In his thesis [9] Marko Salmenkivi introduced methods for intensity 
modeling; that is, assuming a sequence of independent events in time, finding 
a piece-wise constant function describing the intensity of the Poisson-process 
producing that sequence. 

Most of the above methods is directly suitable for online detection of change 
points. We experimented with the online algorithm of Keogh et al., trying to use 
them to detect epidemics. The idea was to segment the series, and then look at 
the slope of the last segments. Unfortunately, we were unable to calibrate an error 
criterion that would be both specific enough and produce a sort enough delay, 
and unable to adapt the method to answer several surveillance questions (for 
instance, comparing this month’s situation to the same months two previous 
years proved impractical). Similar problems apply to the other change-point 
detection/segmentation approaches. 

7 Conclusions 

A method was developed to automatically detect epidemics from an online time- 
series. Despite its simplicity, the method works reliably. In all the test data, all 
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epidemics were correctly detected. Even when calculated day-wise instead of 
epidemics- wise, we achieved specificity over 99 % and sensitivity over 80 %. Also 
the results on real-life data were very encouraging. 

A nice feature of the method is the adaptability that is achieved by changing 
the null hypothesis period. The same method can readily answer several kinds of 
questions of interest such as detecting acute short-term changes and comparing 
epidemics. 

However, caution must be used before widely applying this - or any - method 
of online surveillance. It must be kept in mind that all surveillance requires the 
capacity to deal with both true and false alarms; surveillance is useless unless 
personnel exist to work on the alarms. A separate prospective study will be 
necessary to establish the actual benefits of surveillance. 
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Abstract. In this paper, we present an indiscernibility-based clustering 
method that can handle relative proximity. The main benefit of this 
method is that it can be applied to proximity measures that do not satisfy 
the triangular inequality. Additionally, it may be used with a proximity 
matrix - thus it does not require direct access to the original data values. 
In the experiments we demonstrate, with the use of partially mutated 
proximity matrices, that this method produces good clusters even when 
the employed proximity does not satisfy the triangular inequality. 



1 Introduction 

Clustering is a powerful tool for revealing underlying structure of the data. A 
number of methods, for example, hierarchical, partial, and model-based methods, 
have been proposed and have produced good results on both artificial and real- 
life data [1]. 

In order to assess the quality of clusters being produced, most of the con- 
ventional clustering methods employ quality measures that are associated with 
centroids of clusters. For example, the internal homogeneity of a cluster can be 
measured as the sum of differences from objects in the cluster to their centroid, 
and it can be further used as a component of the total quality measure for as- 
sessing a clustering result. Such centroid-based methods work well on datasets in 
which the proximity of objects satisfies the natures of distance that are, positiv- 
ity {d{x,y) > 0), identity (d{x,y) = 0 iff x = y), symmetry (d{x,y) = d{y,x)), 
and triangular inequality (d(a;, z) < d{x, y) + d{y, z)), for any objects x, y and 2 . 
However, they have a potential weakness in handling relative proximity. Relative 
proximity is a class of proximity measures that is suitable for representing sub- 
jective similarity or dissimilarity such as the degree of likeness between people. It 
may not satisfy the triangular inequality because the proximity d{x, z) of x and 
z is allowed to be independent of y. Usually, the centroid c of objects x, y and z 
is expected to be in their convex hull. However, if we use relative proximity, the 
centroid can be out of x, y, and z’s convex hull because proximity between c and 
other objects can be far greater (if we use dissimilarity as proximity) or smaller 
(if we use similarity) than d{x,y), d{y,z) and d{x,z). Namely, a centroid does 
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not hold its geometric properties under these conditions. Thus another criterion 
should be used for evaluating the quality of the clusters. 

In this paper, we present a new clustering method based on the indiscernibil- 
ity degree of objects. The main benefit of this method is that it can be applied 
to proximity measures that do not satisfy the triangular inequality. Additionally, 
it may be used with a proximity matrix - thus it does not require direct access 
to the original data values. 

2 The Method 

2.1 Overview 

Our method is based on iterative refinement of N binary classifications, where N 
denotes the number of objects. First, an equivalence relation, that classifies all 
the other objects into two classes, is assigned to each of N objects by referring 
to the relative proximity. Next, for each pair of objects, the number of binary 
classifications in which the pair is included in the same class is counted. This 
number is termed the indiscernibility degree. If the indiscernibility degree of a 
pair is larger than a user-defined threshold value, the equivalence relations may 
be modified so that all of the equivalence relations commonly classify the pair 
into the same class. This process is repeated until class assignment becomes 
stable. Consequently, we may obtain the clustering result that follows a given 
level of granularity, without using geometric measures. 

2.2 Assignment of Initial Equivalence Relations 

When dissimilarity is defined relatively, the only information available for object 
Xi is the dissimilarity of Xi to other objects, for example to xj, d{xi, xj) . This is 
because the dissimilarities for other pairs of objects, namely d{xj,Xk), Xj,Xk yf 
Xi, are determined independently of Xi. Therefore, we independently assign an 
initial equivalence relation to each object and evaluate the relative dissimilarity 
observed from the corresponding object. 

Let U = {xi,X2, ...,xn} be the set of objects we are interested in. An equiv- 
alence relation Ri for object Xi is defined by 

U/R, = {R, U - R}, (1) 

where 

Pi = {xj\ d{xi,Xj) <Thdi}, 'dxj&U. (2) 

d{xi,Xj) denotes dissimilarity between objects Xi and Xj, and Th^i denotes an 
upper threshold value of dissimilarity for object Xi. The equivalence relation, 
Ri classifies U into two categories: Pi, which contains objects similar to Xi and 
U — Pi, which contains objects dissimilar to Xi. When d{xi,Xj) is smaller than 
Thdi, object Xj is considered to be indiscernible to Xi. U /Ri can be alternatively 
writte n as U/Ri = {{[a:*]/?,}, {[xj]/?,}}, where [xi]R. n = (/> and U 

[xi]Ri = u hold. 
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Fig. 1. An example of function /(d) gen- Fig. 2. Relations between /(d) and 
erated by d(a;i, Xs). its smoothed first- and second-order 

derivatives F'{d) and F''{d). 

Definition of the dissimilarity measure d{xi,Xj) is arbitrary. If all the at- 
tribute values are numerical, ordered, and independent of each other, conven- 
tional Minkowski distance 

( Na \ P 

^ \Xia-XjJPj , (3) 

where Na denotes the number of attributes, Xia denotes the a-th attribute of 
object Xi, and p denotes a positive integer, is a reasonable choice since it has 
been successfully applied to many areas and its mathematical properties have 
been well investigated. More generally, any type of dissimilarity measure can be 
used regardless of whether or not the triangular inequality is satisfied among 
objects. 

Threshold of dissimilarity Thdi for object Xi is automatically determined 
based on the spatial density of objects. The procedure is summarized as follows. 

1. Sort d{xi,Xj) in ascending order. For simplicity, we denote the sorted dis- 
similarity using the same representation d{xi,Xs), 1 < s < N . 

2. Generate a function f{d) that represents the cumulative distribution of d. 
For a given dissimilarity d, function / returns the number of objects whose 
dissimilarity to Xi is smaller than d. Figure 1 shows an example. Function 
f{d) can be generated by linearly interpolating f{d{xi,Xs)) = n, where n 
corresponds to the index of Xg in the sorted dissimilarity list. 

3. Obtain the smoothed second-order derivative of f{d) as a convolution of f{d) 
and the second-order derivative of Gaussian function as follows. 

F''{d) = r (4) 

J-oo a'^'dZ'K 

where f{d) = 1 and f{d) = N are used for d < 0 and d > 1 respectively. 
The smoothed first-order derivative F'{d) of /(d) represents spatial density 
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of objects because it represents increase or decrease velocity of the objects 
induced by the change of dissimilarity. Therefore, by calculating its further 
derivative as F"{d), we find a sparse region between two dense regions. Figure 
2 illustrates relationship between f{d) and its smoothed derivatives. The 
most sparse point d* should take a local minimum of the density where the 
following conditions are satisfied. 

F''{d* -Ad )<0 and F''{d* + Ad) > 0. (5) 

Usually, there are some d*s in f{d) because f{d) has multiple local minima. 
The value of a in the above Gaussian function can be adjusted to eliminate 
meaningless small minima. 

4. Choose the smallest d* and object Xj* whose dissimilarity is the closest to 
but not larger than d*. Finally, the dissimilarity threshold Thdi is obtained 
as Thdi = d{xi,Xj*). 

2.3 Refinement of Initial Equivalence Relations 

Suppose we are interested in two objects, Xi and Xj. In indiscernibility-based 
classification, they are classified into different categories regardless of other re- 
lations, if there is at least one equivalence relation that has an ability to discern 
them. In other words, the two objects are classified into the same category only 
when all of the equivalence relations commonly regard them as indiscernible ob- 
jects. This strict property is not acceptable in clustering because it will generate 
many meaningless small categories, especially when global associations between 
the equivalence relations are not taken into account. We consider that objects 
should be classified into the same category when most of, but not necessarily 
all of, the equivalence relations commonly regard the objects as indiscernible. 
In the second stage, we perform global optimization of initial equivalence rela- 
tions so that they produce adequately coarse classification to the objects. The 
global similarity of objects is represented by a newly introduced measure, the 
indiscernibility degree. Our method takes a threshold value of the indiscernibil- 
ity degree as an input and associates it with the user-defined granularity of the 
categories. Given the threshold value, we iteratively refine the initial equivalence 
relations in order to produce categories that meet the given level of granularity. 

Now let us assume U = {xi, X2, 2^3, 0:4, 0:5} and classifications of C/ by R = 
{Ri , R2, R3, R4, R5} is given as follows. 

U/Ri = {{XI,X 2 ,X 3 .},{X 4 :,X 3 }}, 

U/R2 = {{21, X2,X3}, {24,2:5}}, 

U/R 3 = {{3:2, 3^3, 2:4}, {21, 25}}, 

U/Ri = {{21, 22, 23, 24}, {25}}, 

C//i?5 = {{24,25},{2 i ,22,23}}. (6) 

This example contains three types of equivalence relations: R\ (= i?2 = Rb), R3 
and R4. Since each of them classifies U slightly differently, classification of U 
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by the family of equivalence relations R, Cf/R, contains four very small, almost 
independent categories. 

U/'R. = {{xi},{a;2,X3},{a;4},{a;5}}. (7) 



In the following we present a method to reduce the variety of equivalence rela- 
tions and to obtain coarser categories. 

First, we define an indiscer nihility degree, ^{xi,Xj), for two objects Xi and 
Xj as follows. 



where 



and 



7(x*,Xj) = 



Z^fc=l 



Xi,Xj) 



eL=i xj) + x:L=i 5f\x„ Xj) ’ 



sr^^{xi,xj) = 



1, if (xi G [xk]Rt, A Xj G [xk]Rj 
0, otherwise. 



St(xi,Xj) 



1, if {xi G [xk]Rk A Xj ^ [xk]Rk) or 
if {xi ^ A Xj G [xk]R^) 

0, otherwise. 



(8) 

(9) 



( 10 ) 



Equation (9) shows that takes 1 only when the equivalence relation 

Rk regards both Xi and Xj as indiscernible objects, under the condition that 
both of them are in the same equivalence class as Xk- Equation (10) shows 
that 5f‘‘‘{xi,Xj) takes 1 only when Rk regards Xi and Xj as discernible objects, 
under the condition that either of them is in the same class as Xk- By summing 
Xj) and Sf‘^{xi,Xj) for all fc(l < k < |t/|) as in Equation (8), we obtain 
the percentage of equivalence relations that regard Xi and Xj as indiscernible 
objects. Note that in Equation (9), we excluded the case when Xi and Xj are 
indiscernible but not in the same class as Xfc. This is to exclude the case where 
Rk does not significantly put weight on discerning Xj and Xj. As mentioned 
in Section 2.2, Pk for Rk is determined by focusing on similar objects rather 
than dissimilar objects. This means that when both of Xi and Xj are highly 
dissimilar to Xfe, their dissimilarity is not significant for Xk, when determining 
the dissimilarity threshold Thdk- Thus we only count the number of equivalence 
relations that certainly evaluate the dissimilarity of Xi and Xj. 

For example, the indiscernibility degree 7 (xi,X 2 ) of objects Xi and X 2 in the 
above case is calculated as follows. 



7(a;i,a:2) 



ELi Si^‘^^-^{xi,X2) + ELi 5f"{xi,X2) 
(l-kl-kO-kl-kO)-k(O-kO-kl-kO-kO) 



3 

4' 



( 11 ) 



Let us explain this example with the calculation of the numerator (l-l-l-l-O-l-l-l-O). 
The first value 1 is for <5(”‘^*®(xi, X 2 )ass/iown. Since xi and X 2 are in the same 
class of R\ and obviously, they are in the same class to xi, i5j”'^*^(xi, X 2 ) = 1 
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Table 1. Degree 7 for ob- 
jects in Eq. ( 6 ). 





Xl X 2 Xz X4 X 5 


Xi 


3/3 3/4 3/4 1/5 0/4 


X2 


4/4 4/4 2/5 0/5 


X3 


4/4 2/5 0/5 


X4 


3/3 1/3 


X5 


1/1 



Table 2. Degree 7 after 
the first refinement. 





Xl X 2 Xz X4 Xz 


a:i 


3/3 3/4 3/4 2/4 1/5 


0:2 


4/4 4/4 3/4 0/5 


0:3 


4/4 3/4 0/5 


X4 


3/3 1/5 


X5 


1/1 



Table 3. Degree 7 after 
the second refinement. 





Xl X 2 Xz X4 Xz 


a:i 


4/4 4/4 4/4 4/4 0/5 


0:2 


4/4 4/4 4/4 0/5 


0:3 


4/4 4/4 0/5 


X4 


4/4 0/5 


X5 


1/1 



holds. The second value is for and analogously, it becomes 1. The 

third value is for 0:2). Since x\ and X 2 are in the different classes of 

it becomes 0. The fourth value is for X2) and it obviously, becomes 1. 

The last value is for X2). Although x\ and X 2 are in the same class of 

i?5, their class is different to that of x^. Thus X2) returns 0. 

Indiscernibility degrees for all of the other pairs in U are tabulated in Table 
1. Note that the indiscernibility degree of object Xi to itself, 7(3;^, Xi), will always 
be 1. 

From its definition, a larger 7(xi, Xj) represents that Xi and Xj are commonly 
regarded as indiscernible objects by the large number of the equivalence relations. 
Therefore, if an equivalence relation Ri discerns the objects that have high 7 
value, we consider that it represents excessively fine classification knowledge 
and refine it according to the following procedure (note that Ri is rewritten as 
Ri below for the purpose of generalization) . 

Let G R be an initial equivalence relation on [/. A refined equivalence 
relation i?' G R' of Ri is defined as 

U/R[ = {Pi U - P'}, (12) 

where P' denotes a set of objects represented by 

Pi = > Ph}, yxj G U. (13) 

and Th denotes the lower threshold value of the indiscernibility degree above, in 
which Xi and Xj are regarded as indiscernible objects. It represents that when 
^{xi,Xj) is larger than T/,, Ri is modified to include Xj into the class of Xi. 

Suppose we are given Th = 3/5 for the case in Equation (6). For R\ we 
obtain the refined relation R{ as 

P/P'i = {{xi,X2, X3}, {x4, a^s}}, (14) 

because, according to Table 1, 7(0:1, xi) = I > Th = 3/5, 7(xi,a:2) = 3/4 > 3/5, 
7(xi,a:3) = 3/4 > 3/5, 7(0:1, 0:4) = 1/5 < 3/5 , 7(0:1, X5) = 0/5 < 3/5 hold. 
In the same way, the rest of the refined equivalence relations are obtained as 
follows. 



U/P '2 = {{a:i,a;2,X3},{o:4,X5}}, 
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U/R's = {{xi,X2,X3},{a;4,X5}}, 

C//i?4 = {{a;4}, {a;i,a;2,X3,X5}}, 

U/R'^ = {{x5},{xi,X2,X2,,Xi}}. (15) 

Then we obtain classification of U by the refined family of equivalence relations 
R' as follows. 

C//R' = {{xi,a;2,a;3}, {X 4 }, {a^s}}- (16) 

In the above example, R^, R 4 and R^ are modified so that they include similar 
objects into the equivalence class of X 3 , X 4 and X 5 , respectively. Three types of 
the equivalence relations remain, however, the categories become coarser than 
those in Equation (7) by the refinement. 



2.4 Iterative Refinement of Equivalence Relations 

It should be noted that the state of the indiscernibility degrees could also be 
changed after refinement of the equivalence relations, since the degrees are re- 
calculated using the refined family of equivalence relations Rb 

Suppose we are given another threshold value T/j = 2/5 for the case in Equa- 
tion (6). According to Table 1, we obtain R' after the first refinement, as follows. 

R/R'l = {{X1,X2,X3},{X4,X5}}, 

U/R '2 = {{X1,X2,X3,X4},{X5}}, 

U/R's = {{X1,X2,X3,X4},{X5}}, 

U/R'^ = {{X 2 ,XZ,X 4 },{XI,X^}}, 

U/R!^ = {{X^},{XI,X2,XZ,X4}}- (17) 

Hence 

C//R' = {{xi}, {x2, X3}, {a;4}, {a^s}}- (18) 

The categories in U /R' are exactly the same as those in Equation (7). However, 
the state of the indiscernibility degrees are not the same because the equivalence 
relations in R' are different from those in R. Table 2 summarizes the indiscerni- 
bility degrees, recalculated using R'. In Table 2, it can be observed that the 
indiscernibility degrees of some pairs of objects, for example 7(0:1, X4), increased 
after the refinement, and now they exceed the threshold th = 2/5. Thus we 
perform refinement of equivalence relations again using the same and the 
recalculated 7. Then we obtain 

U/R'l = {{X1,X2,X3,X4},{X5}}, 

U/R '2 = {{X1,X2,X3,X4},{X5}}, 

U/R'3 = {{X1,X2,X3,X4},{X3}}, 

U/R'^ = {{a;i,X2,X3,o;4},{x5}}, 

U/R'^ = {{3:5}, {a;i,o;2,a;3,X4}}. 



(19) 
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Hence 

U /Yi! = {{x-i,X2,X3,Xi},{x^}}. ( 20 ) 

After the second refinement, the number of the equivalence relations in R' are 
reduced from 3 to 2, and the number of categories are also reduced from 4 to 

2. We further update the state of the indiscernibility degrees according to the 
equivalence relations after the second refinement. The results are shown in Table 

3. Since no new pairs, whose indiscernibility degree exceeds the given threshold 
appear, refinement process may be halted and the stable categories may be 
obtained, as in Equation (20). 

As shown in this example, refinement of the equivalence relations may change 
the indiscernibility degree of objects. Thus we iterate the refinement process 
using the same Th until the categories become stable. Note that each refinement 
process is performed using the previously ‘refined’ set of equivalence relations. 

3 Experimental Results 

We applied the proposed method to some artificial numerical datasets and evalu- 
ated its clustering ability. Note that we used numerical data, but clustered them 
without using any type of geometric measures. 



3.1 Effects of Iterative Refinement 

We first examined the effects of refinement of the initial equivalence relations. A 
two-dimensional numerical dataset was artificially created using Neyman-Scott 
method [2]. The number of seed points was set to 5. Each of the five clusters 
contained approximately 100 objects, and a total of 491 objects were included 
in the data. We evaluated validity of the clustering result based on the following 
measure: 

. /"lA'RnCI IXrI-iCI'i 
VaM.ty yR(C) = mm (^^, j ■ 

where ATr and C denote the clusters obtained by the proposed method and the 
expected clusters, respectively. The threshold value for refinement Th was set 
to 0.2, meaning that if two objects were commonly regarded as indiscernible by 
20% of objects in the data, all the equivalence relations were modified to regard 
them as indiscernible objects. 

Without refinement, the method produced 461 small clusters. Validity of the 
result was 0.011, which was the smallest value assigned to this dataset. This 
was because the small size of the clusters produced very low coverage, namely, 
amount of overlap between the generated clusters and their corresponding ex- 
pected clusters was very small compared with the size of the expected clusters. 

By performing refinement one time, the number of clusters was reduced to 
429, improving validity to 0.013. As the refinement proceeds, the small clusters 
merged as shown in Figures 3 and 4. Validity of the results continued to increase. 
Finally, clusters became stable at the 6th refinement, where 10 clusters were 
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Fig. 3. Clusters after 4th refinement. 
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Fig. 4. Clusters after 6th refinement. 



formed as shown in Figure 4. Validity of the clusters was 0.927. One can observe 
that a few small clusters, for example, clusters 5 and 6, were formed between the 
large clusters. These objects were classified into independent clusters because of 
the competition of the large clusters containing almost the same populations. 
Aside from this, the results revealed that the proposed method automatically 
produced good clusters that have high correspondence to the original ones. 

3.2 Capability of Handling Relative Proximity 

In order to validate the method’s capability of handling relative proximity, we 
performed clustering experiments with another dataset. The data was originally 
generated on the two-dimensional Euclidean space likewise the previous dataset; 
however, in this case we randomly modified distances between data points in 
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Table 4. Comparison of the clustering results 



Mutation 

Ratio[%] 


0 


10 


20 


30 


40 


50 


AL-AHC 


0.990 


0.688 ± 0.011 


0.670 ± 0.011 


0.660 ± 0.011 


0.633 ± 0.013 


0.633 ± 0.018 


CL-AHC 


0.990 


0.874 ± 0.076 


0.792 ± 0.093 


0.760 ± 0.095 


0.707 ± 0.098 


0.729 ± 0.082 


Our method 


0.981 


0.980 ± 0.002 


0.979 ± 0.003 


0.980 ± 0.003 


0.977 ± 0.003 


0.966 ± 0.040 



order to make the induced proximity matrix not fully satisfy the triangular 
inequality. 

The dataset was prepared as follows. First, we created a two-dimensional 
data set by using the Neyman-Scott method [2]. The number of seed points 
was set to three, and a total of 310 points were included in the dataset. Next, 
we calculated the Euclidean distances between the data points and constructed 
a 310 X 310 proximity matrix. Then we randomly selected some elements of 
the proximity matrix and mutated them to zero. The ratio of elements to be 
mutated was set to 10%, 20%, 30%, 40%, and 50%. For each of these mutation 
ratio, we created 10 proximity matrices in order to include enough randomness. 
Consequently, we obtained a total of 50 proximity matrices. 

We took each of the proximity matrices as an input and performed clustering 
of the dataset. Parameters used in the proposed method were manually deter- 
mined to a = 15.0 and = 0.3. Additionally, we employed average-linkage and 
complete-linkage agglomerative hierarchical clustering methods (for short, AL- 
AHC and CL-AHC respectively) [3] for the purpose of comparison. Note that we 
partly disregarded the original data values and took the mutated proximity ma- 
trix as input of the clustering methods. Therefore, we did not employ clustering 
methods that require direct access to the data value. 

We evaluated validity of the clustering results using the same measures as 
in the previous case. Table 4 shows the comparison results. The first row of the 
table represents the ratio of mutation. For example, 30 represents 30% of the 
elements in the proximity matrix were mutated to zero. The next three rows 
contain the validity obtained by AL-AHC, CL-AHC and the proposed method, 
respectively. Except for the cases in zero mutation ratio, validity is represented 
in the form of ’mean ± standard deviation’, summarized from the 10 randomly 
mutated proximity matrices. 

Without any mutation, the proximity matrix exactly corresponded to the 
one obtained by using the Euclidean distance. Therefore, both of AL-AHC and 
CL-AHC could produce high validity over 0.99. The proposed method also pro- 
duced the high validity over 0.98. However, when mutation had occurred, the 
validity of clusters obtained by AL-AHC and CL-AHC largely reduced to 0.688 
and 0.874, respectively. They kept decreasing moderately following the increase 
of mutation. The primary reason for inducing decrease of the validity was consid- 
ered as follows. When the distance between two objects was forced to be mutated 
into zero, it brought a kind of local warp to the proximity of the objects. Thus 
the two objects could become candidates of the first linkage. If the two objects 
were originally belonged to the different clusters, these clusters were merged at 





202 Shoji Hirano and Shusaku Tsumoto 




Attribute 1 

Fig. 5. Clustering results by AL-AHC. Ratio of mutation was 40%. Linkage was ter- 
minated when three clusters were formed. 




Attribute 1 

Fig. 6. Clustering results by CL-AHC. Ratio of mutation was 40%. Linkage was ter- 
minated when three clusters were formed. 



an early stage of the merging process. Since both of AL-AHC and CL-AHC do 
not allow inverse of the cluster hierarchy, these clusters would never be sepa- 
rated. Consequently, inappropriately bridged clusters were obtained as shown in 
Figures 5 and 6. 

On the contrary, the proposed method produced high validity even when the 
mutation ratio approached to 50%. In this method, effects of a mutation was very 
limited. The two concerning objects would consider themselves as indiscernible 
objects, however, the majority of other objects never change their classification. 
Although the categories obtained by the initial equivalence relations could be 
distorted, they could be globally adjusted through iterative refinement of the 
equivalence relations. Consequently, good clusters were obtained as shown in 
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Fig. 7. Clustering results by the proposed method. Ratio of mutation was 40%. Itera- 
tion terminated at the fourth cycle. 



Figure 7. This demonstrates the capability of the method for handling locally 
distorted proximity matrix that do not satisfy the triangular inequality. 

4 Conclusions 

In this paper, we have presented an indiscernibility-based clustering method, 
which clusters objects according to their relative proximity. Experimental results 
from the artificially created numerical datasets demonstrated that this method 
could produce good clusters even when the proximity of the objects did satisfy 
the triangular inequality. Future work include reduction of the computational 
complexity of the method and empirical evaluation of its clustering ability on 
large and complex real-life databases. 
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Abstract. Advanced personalized e-applications require comprehensive knowl- 
edge about their user’s likes and dislikes in order to provide individual product 
recommendations, personal customer advice and custom-tailored product of- 
fers. In our approach we model such preferences as strict partial orders with “A 
is better than B” semantics, which has been proven to be very suitable in vari- 
ous e-applications. In this paper we present novel Preference Mining techniques 
for detecting strict partial order preferences in user log data. The main advan- 
tage of our approach is the semantic expressiveness of the Preference Mining 
results. Experimental evaluations prove the effectiveness and efficiency of our 
algorithms. Since the Preference Mining implementation uses sophisticated 
SQL statements to execute all data-intensive operations on database layer, our 
algorithms scale well even for large log data sets. With our approach personal- 
ized e-applications can gain valuable knowledge about their customers’ prefer- 
ences, which is essential for a qualified customer service. 



1 Introduction 

The enormous growth of web content and web-based applications leads to an unsatis- 
factory behavior for users: search engines retrieve a huge number of results and they 
are left on their own to find interesting web sites or preferred products. Such a behav- 
ior leads not only to frustrated users but also to a reduction of turnover in commercial 
businesses because customers who are willing to buy cannot do it since they do not 
find the right product even if it is available. In recent years, several techniques have 
been developed to build user adaptive web sites and personalized web applications. 
For instance, e-commerce applications use link personalization to recommend items 
based on the customer’s buying history or some categorization of customers based on 
ratings and opinions [11]. Another technique is content personalization: web pages 
present different information to different users based on their individual needs. There- 
by, the user can indicate his preferences explicitly using the predefined tools of the 
underlying portal or the preferences may be inferred automatically from his profile. 

State-of-the-art personalization techniques suffer from some drawbacks. Manually 
customizing web sites is not very feasible to the customer since it is a very time- 
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consuming task to select relevant content from the huge repertoire provided by the 
web portal. Personalizing products or web content automatically is a more promising 
approach. However, the current approaches of automatic personalization lack of pref- 
erence models with limited expressiveness. State-of-the-art techniques either use 
scores to describe preferences [7] or just distinguish between liked and disliked val- 
ues [9]. Thus, complex “I like A more than B ’’-relationships as well as preferences 
for numeric attributes cannot be expressed in a natural way. Furthermore, these ap- 
proaches are not able to handle dependencies among preferences. For example, two 
preferences can be of equally importance to a customer or one preference can be 
preferred to another one. 

A very expressive and mathematically well-founded framework for preferences 
has recently been introduced [5]. Customer preferences are modeled as strict partial 
orders with “A is better than B”-semantics, where negative, numeric and complex 
preferences form special cases. This approach has been proven to be very suitable for 
modeling user preferences of almost any complexity. Standard query languages like 
SQL and XPATH were extended by such preferences [6] in order to deal carefully 
with user wishes. In this paper, we present algorithms for automatically mining such 
strict partial order preferences from user log data. Basic categorical and numerical 
preferences are discovered based on the frequencies of the different attribute values in 
the user log. These base preferences are then combined to detect complex prefer- 
ences. 

The rest of the paper is organized as follows: After a survey of related work in sec- 
tion 2 we describe the underlying preference model and Preference Mining require- 
ments in section 3. In section 4, we present algorithms for mining categorical, nu- 
merical and complex preferences. Section 5 summarizes the results of an extensive 
experimental evaluation of the accuracy and efficiency of the proposed algorithms. 
We conclude our paper with a summary and outlook in section 6. All proves are omit- 
ted her, but can be found in the extended version of this paper [3]. 



2 Related Work 

Several research groups have studied the usage of log data analysis for personalized 
applications. In particular, web log mining is a commonly used approach of analyzing 
web log data with data mining techniques for the discovery of sequential patterns, 
association rules or user clusters. Such mining techniques have been applied to pro- 
vide personalized link recommendations to web users [10]. Thereby the user profile 
of the current user is matched against one or more previously discovered usage pro- 
files. 

Lin et al. applied association rule mining algorithms for collaborative recommen- 
dations [9]. They mapped user ratings (liked/disliked/not rated) of articles into trans- 
actions and used their algorithms in order to detect association rules within these 
transactions. Such gained rules can be used for the recommendation of articles to the 



users. 
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Beeferman and Berger analyzed query log data of search engines [1]. They devel- 
oped clustering algorithms in order to find groups of URLs that match various key- 
words given by the user. This approach is not only helpful for delivering better search 
results but also for the construction of web categories and the generation of ontolo- 
gies. In [4], Joachims analyzed clickthrough data to improve the results of search 
engines. He uses the search results that are chosen by the user as additional informa- 
tion. He argues that selected items are better in the opinion of the user and applies this 
knowledge to find better rankings for future search results. 

Our Preference Mining techniques can work either on web logs or query logs. The 
main advantage of our approach is the semantic expressiveness of the Preference 
Mining results. Our algorithms compute no scores to distinguish between liked and 
disliked values but detect intuitive preferences like positive or negative preferences, 
numerical preferences or complex preferences. Personalized web applications [11] 
can gain significant improvements by using detailed knowledge about user prefer- 
ences. 



3 User Preferences in Log Data 

In this section we revisit those aspects of the preference model of [5] that are relevant 
for the scope of this paper. We also define requirements on the user log data for min- 
ing such preferences. 



3.1 Preferences as Strict Partial Orders 

A preference P is defined as a strict partial order P = (A, <p), where A = (Aj, ..., A,^} 
denotes a set of attributes with corresponding domains dom(A). The domain of A is 
defined as Cartesian product of the dom(Aj), <p c dom(A) x dom(A) and x <p y is 
interpreted as “y is better than x”. A set of intuitive preference constructors for base 
and complex preferences is defined. 

The constructors for base preferences on categorical domains are POS(A, POS- 
set), NEG(A, NEG-set), POS/NEG(A, POS-set; NEG-set), POS/POS(A, POS 1-set; 
POS2-set) and EXP(A, E-graph). The POS-set c dom(A) of a POS preference de- 
fines a set of values that are better than all other values of dom(A). Analogously, the 
NEG-set of a NEG preference describes disliked values. The POS/NEG preference is 
a combination of the previous preferences and in a POS/POS preference optimal 
values (POS 1 -set) and alternative values (POS2-set) can be specified. In E-graph of 
an EXPLICIT preference a user can specify any better-than relationships. 

The preference constructors for numerical domains include AROUND(A, z), BET- 
WEEN(A, [low, up]), LOWEST(A) and HIGHEST(A). In an AROUND preference 
the desired value is z, but if this it not available values with nearest distance apart 
from z are best alternatives. Eor a BETWEEN preference the values within [low, up] 
are optimal. For LOWEST (HIGHEST) preferences lower (higher) values are better. 
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Preferences can inductively be combined with complex preference constructors. A 
Pareto preference P = Pj ® P^ treats the underlying preferences as equally important 
and a Prioritized preference P = Pj & P^ treats Pj as more important than P^. For in- 
stance, P = POS(author, {Douglas Adams, Edgar Wallace}) & NEG(binder, {paper- 
back}) denotes a POS preference for the authors Douglas Adams and Edgar Wallace, 
and a NEG preference for paperbacks, whereby the latter preference is less important. 

This definition of preference constructors has been proven to be appropriate to de- 
scribe complex user wishes. Preference engineering examples are shown in [5]. Our 
Preference Mining developments should be consistent to this preference model. 
Therefore, not only all base and complex preferences should be detectable by the 
Preference Miner but also preference properties like preference hierarchies or prefer- 
ence algebra laws (see [5] for details) should be valid for the detected preferences. 



3.2 Requirements on User Log Data in Web Applications 

Data mining benefits from the availability of a huge amount of data since having 
many records ensures the statistical significance of patterns [8]. Log data of user 
transactions can have several sources like web server log-files or transaction logging 
on an application server. 

Web server logs are generated by the web server when a user is visiting a web site. 
Such files can comply with standardized formats like the Common Logfile Format'. 
The log data includes the IP of the client host, the current timestamp and the URL 
(uniform resource locator) he is visiting. Valuable information about a user’s wishes 
is stored in the URL, since it contains not only the address but also requested key- 
words or preferred product properties the user inserted into a web form. For example, 
if the user requests the book “The Raven” in the e-shop Barnes & Noble^ the logged 
URL is http: //search. barnesandnoble. com/booksearch/results.asp ? WRD= The+ Raven. 
But web server logs also have some disadvantages, especially for e-commerce appli- 
cations [8]. Events like “add to cart” or “change item” are not available in web logs. 
Eurthermore, a user can deactivate cookies in his browser, so no session information 
or user identification is available. Preference Mining on web server logs requires 
some data preprocessing. User input like “The Raven” in the above example has to be 
extracted from the logged URL and has to be stored in a relational database since our 
Preference Mining algorithms work on database relations. Eurthermore, user identifi- 
cation is required to detect preferences for each customer separately. 

Application server logs can handle user transactions much better [8]. User and ses- 
sion identification can be accomplished with a login and logout mechanism. Another 
advantage is the capability to detect business events like “add to cart” or “buy items”. 
Eor example, an e-commerce application server can record queries, search results, 
selected items and bought products for each customer separately. Eurthermore, appli- 
cation server log data can be stored in databases and therefore huge amount of log 



' http://www.w3.org/Daemon/User/Config/Logging.html 
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data can be managed by using database technology. The Preference Mining algo- 
rithms can work directly on these log relations without any data preprocessing. For 
instance, analyzing the properties of bought produets ean lead to preferences about 
liked and disliked features, price preferences and dependencies between such prefer- 
ences. 

While browsing or shopping in an online environment, a customer has typically 
several different types of input fields for interacting with the underlying system. Text 
fields allow the input of keywords and choices allow the selection of static or dy- 
namic predefined values of an attribute. To describe these different situations we 
define the closed world assumption and the difference between static and dynamic 
domains. 

Definition 1 (Closed world assumption (CWA)) 

The assumption that a customer knows all possible values of an attribute is called 
Closed World Assumption or CWA. If this assumption doesn’t hold we abbreviate it 
with -iCWA. 

Definition 2 (Static and dynamic domains) 

If a domain of attribute values is constant over time, we call it a static domain other- 
wise we call it a dynamic domain. 

The CWA is required for the detection of negative preferences since only if the user 
knows all possible values we can assume dislike for values he never selected. Other- 
wise (-iCWA), we can’t decide whether he doesn’t know or doesn’t like such values. 
For instance, in a book shop the customer knows all possible values for binder (pa- 
perback or hardcover) but doesn’t know all available authors. After submitting a 
search query, a customer gets a set of results and chooses one or more of them as his 
preferred products. Such search results define dynamic domains and can lead to valu- 
able clickthrough data, which can be used to get information about explicit user pref- 
erences since the clicked items of the query result are preferred by the user [4]. 

4 Preference Mining Algorithms 

In this section we present algorithms for mining the strict partial order preferences 
introduced in section 3.1. Our methods work on log relations as described in section 
3.2 and use appropriate data mining and statistical methodologies in order to detect 
the right preference and correct additional information like POS-sets. To detect base 
preferences we use the frequencies of the different values in the log relation. 

Definition 3 (Frequency of a value) 

Let A be an attribute of a log relation R and xG dom(A). The number of entries of x 
in R(A) is called frequency of x or freq^(x). If dom(A) is numerical, freq^([Xj, x^]) 
denotes the number of entries of all values between Xj and x^ (Xj < x^). 

We have introduced the concept of user-defined preferences P = (A, <p). The actual 
user preferences shall be predicted from the implicit preferences hidden in the user 
log data. To that purpose, we define data-driven preferences denoted by Pj, = (A, <p„). 
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Definition 4 (Data-driven preference) 

• For categorical domains dom(A) a data-driven preference = (A, <pp) is defined 

as: X <p„ y iff freq^(x) < freq^(y). 

• For numerical domains dom(A) a data-driven preference Pp, = (A, <pp,) is defined 

as: X <pp y iff 3s > 0: freq^([x-s, x-hs]) < freq^([y-s, y+s]). 

As it can easily be shown data-driven preferences define strict partial orders. Depend- 
ing on the design of the log data, values can be products (e.g. search results) or just 
product properties like color or price. If the frequency of a value x is zero, a customer 
has never selected the according value. If CWA holds, freq^(x) = 0 means that a cus- 
tomer doesn’t like the property x because he never selected it although he knows it. 
Otherwise, if CWA doesn’t hold, the customer may either not like the property x or 
may never have heard of it. The relation freq^(x) < freq^(y) shows that the corre- 
sponding customer has selected y more often than x. In this sense the relation x <pp y 
denotes a preference. 

Numeric domains need a slightly different approach to data-driven preferences. 
For instance, an attribute A may have the real numbers as domain (dom(A) = IR) and 
we want to test, if a user has a data-driven LOWEST(A) preference, i.e. lower values 
are better and should occur with higher frequencies. Since IR consists of an infinity 
number of different values, the log relation only contains some of them and typically 
each value occurs only a few times in the log relation. Therefore, we use frequencies 
of intervals. E.g. for a data-driven LOWEST preference the relation freq^([x-s, x-HS]) 
< freq^([y-S, y+s]) for y < x must hold for some s. 



4.1 Mining Categorical Preferences 

Based on Pp,= (A, <pp) we can define data-driven preferences for categorical domains. 

Definition 5 (Data-driven preferences for categorical data) 

Let A he a categorical attribute of a log relation R and POS-set, NEG-set, POSl-set, 
POS2-set, E c dom(A). 

• There is a data-driven POS preference, iff V xG POS-set, V yg POS-set: y <p^ x. 

• There is a data-driven NEG preference, iff V xG NEG-set, V yg NEG-set: x <pp y. 

• There is a data-driven POS/POS preference, iff V xG POSl-set, V yG POS2-set, 

V zg (POSl-setuPOS2-set): y <pp, x and z <pp, y. 

• There is a data-driven POS/NEG preference, iff V xG POS-set, V yG NEG-set, 

V zg (POS-set u NEG-set): z <pp x and y <pj, z. 

• Let <p be a strict partial order on E. A data-driven EXPLICIT preference holds, iff 

(1) V X, yG E with x y: x <pp y, and (2) V uG E, V vg E: v <pp, u. 

For a data-driven POS preference the values in the POS-set must occur more often 
than the other values and in a data-driven NEG preference the other values must oc- 
cur more often than the values in the NEG-set. POS/POS and POS/NEG run analo- 
gously. A data-driven EXPLICIT preference with underlying E-graph exists, if a 
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value y occurs more often than any successor x in E-graph. Values outside the E- 
graph occur with lowest frequencies. 

The main task for an algorithm for mining categorical preferences is the detection 
of proper POS-sets, NEG-sets, etc. Consider the following example of frequencies for 
an attribute author (CWA doesn’t hold, the domain is static): 



Table 1. Example of frequencies for an attribute “author” 



Douglas Adams 


Edgar Wallace 


Natalie Angier 


Agatha Christie 


John Grisham 


50 


49 


2 


3 


2 



The set {Douglas Adams} is a correct POS-set for a data-driven POS preference. But 
intuitively, the set {Douglas Adams, Edgar Wallace) denotes are more reasonable 
POS-set since these two values occurred much more frequently than Natalie Angier, 
Agatha Christie and John Grisham. The following algorithm for mining categorical 
preferences uses cluster techniques in order to detect such proper sets. 

Algorithm 1: Miner for categorical preferences in static domains 

INPUT: log relation R, attribute A, dom(A) 

(1) Compute for each value Xj the frequency in the log relation freq^(Xj). 

(2) Compute a clustering of the Xj with freq^(x^) > 1 by using a clustering technique. 

(3) Depending on the clustering results we have the following possibilities: 

(a) There is only one cluster Cj and CWA holds. Here we have a NEG(A, 
{xG dom(A)| freq^(x) = 0}) preference. 

(b) There are two clusters Cj and C^, where VcjG Cj,V c^G C^: freq^(c 2 ) < 
freq^(Cj). 

(hi) If -iCWA, we have a POS(A, Cj) preference. 

(b2) If CWA, there is a POS/NEG(A, Cj; {xG dom(A)| freq^(x) = 0}) preference. 

(c) There are three clusters Cj, C^ and C 3 , where V CjG Cj, V c^G C^, V c,e C,: 
freq^CCj) < freq^(c 2 ) < freq^(Cj). Here we have a POS/POS(A, Cj; Cj) preference. 

(d) There are more than three clusters Cj, ..., C_,, where V c^G Cj, V c^G C^, 
..., V c_^G C,,: freq^(c_^) < ... < freq^(c 2 ) < freq^(Cj). Here we have an EXPLICIT 
preference EXP(A, <g) with c_ ... <j, c^ <j, Cj, V CjG Cj, V c^G C^, . . ., V c,^G C,,. 

(e) In all other situations there is no data-driven preference. 

OUTPUT : the detected preference or that no preference was found 

Complexity: If n denotes the number of tuples in the log relation and k is the number 
of different values, the k-means clustering needs O(k^) [2] leading to the overall com- 
plexity 0(n + k^). Typically, we have k « n and with it the complexity 0(n). 

By using a state-of-the-art clustering technique like k-means - and silhouettes for 
getting the right number of clusters, see [ 12 ] - this algorithm detects two clusters Cj = 
{Douglas Adams, Edgar Wallace) and C^ = {Natalie Angier, Agatha Christie, John 
Grisham) leading to a POS(author, {Douglas Adams, Edgar Wallace}) preference in 
the above example. The data-driven preferences constructed in algorithm 1 are cor- 
rect since the frequencies match to our requirements stated in definition 5. Data- 
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driven NEG preferences can only be detected, if the user knows all possible values 
(CWA). 

In dynamic domains the CWA holds, because the user must know the varying val- 
ues for his decisions. By selecting or clicking on one or more of the available values, 
the user provides preference knowledge since he prefers the selected items to the 
other available values. The following algorithm for mining such EXPLICIT prefer- 
ences requires an advanced structure of the log relation. We assume we have the 
information (query _id, value, selected) within the log relation, whereby “value” con- 
tains a value available for the user, “selected” (G {0,1}) denotes whether the accord- 
ing value was selected or not and “query_id” specifies which values belong to one 
search query. The ability of a low-cost construction of such log data has been shown 
in [4]. 

Algorithm 2 (Miner for EXPLICIT preferences in dynamic domains) 

INPUT: log data in the format (query _id, value, selected) 

(1) Compute the k occurring values (Xj, ..., x^) in the log relation. Initialize the better- 
than graph with E-graph = 0. 

(2) EOR(i= 1, ...,k)andFOR(j = i-H 1, ...,k)DO: 

(a) Consider the query ids, whose according values contain Xj and x.. 

(b) Compute the number s of query ids, where x^ was selected and x. wasn’t. 

(c) Compute the number t of query ids, where x. was selected and x^ wasn’t. 

(dl) If s > t and there is no path from x. to x^ in E-graph, set E-graph = E-graph kJ 
(x., Xj). Otherwise, if a path from x. to x^ exists, remove it. 

(d2) If s < t and there is no path from Xj to x. in E-graph, set E-graph = E-graph kJ 
(Xj, X-). Otherwise, if a path from x^ to x. exists, remove it. 

(d3) If s = t remove within E-graph all direct and transitive connections from x^ to 
X. and vice versa. 

OUTPUT: the detected EXPLICIT preference based on E-graph as better-than graph. 

Complexity: In the first step the n entries in the log relation have to be considered 
(0(n)). The two nested EOR-loops have the complexity O(k^). Within the loops all 
the tuples in the log relation need to be analyzed 0(n) and a path between two given 
vertices (O(k^) by using Dijkstra’s algorithm) has to be computed. Thus we have 0(n 
+ k^(n + k^)) as complexity. A main effort lies in the detection of inconsistencies in 
the user’s shopping and browsing behavior. If we assume a consistent user behavior, 
we can avoid the path detection. The resulting complexity would be 0(n + k^n). 

Eor two values x. and x^ the algorithm computes the query ids that have both values in 
the result set. Now x^ is better than Xj, if the user selected it more often. In step (2d) 
cycles are removed. Therefore we check if there is a path from x^ to x, in E-graph 
before inserting (Xj, xJ and vice versa. Cycles can occur, if the browsing or shopping 
behavior of the user has inconsistencies like blue <p red <p green <p blue. In such 
situations the preferences of the customer are not clear and therefore we leave out 
such relations. If s = t, the user is indifferent between x. and x. and therefore existing 
preference relations between x^ and x. have to be removed. 
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4.2 Mining Numerical Preferences 

The distribution of numerical log data defines a statistical density function (p(x). 
Properties of this density function provide information about data-driven preferences. 
For instance, if (p(x) has a unique maximum at z and the gradient is positive for x < z 
and negative for x > z, there is an AROUND preference with around-value z. This 
approach is consistent to the definition of numeric data-driven preferences (def. 4) 
because an increasing density guarantees freq^([x-s, xH-s]) < freq^([y-S, y+s]) for x < 
y and a decreasing density implicates freq^([x-s, xH-s]) > freq^([y-S, y+s]). Thus, in 
the above example values around z are requested most frequently and the frequency 
decreases with increasing distance to z. Since the density is usually unknown, it has 
to be estimated using the underlying numerical log data. In our implementation we 
use histograms as an easy to use and efficient density estimation technique [3]. 



4.3 Mining Complex Preferences 
Definition 6 (Data-driven Prioritized preference) 

Let Pp = (A, <pp) and = (B, be two data-driven preferences and x = (Xj, x^), y 
= (yj, y^) G dom(A)x dom(B). A data-driven Prioritized preference Pp, & Q^, = ({A, 
B }, <p<j.d) is defined as: x y iff x^ <p„ y^ v (x^ = y, a x, y,). 

Data-driven Pareto preferences can be handled analogously [3]. In order to detect 
such complex data-driven preferences we need the definition of associate values. 

Definition 7 (Associate Values) 

Consider a log relation R(A, B, ...). For aG 7t^(R) the associate values in B are de- 
fined as asv^ p(a) = 7t*B((T^,^(R)). 

Thereby ti* denotes the relational projection without removing duplicates. 

Algorithm 3 (Miner for Prioritized preferences) 

INPUT: log relation R(A, B, ...) and a data-driven preference P^, on A 

(1) Compute the set M of maximal values of P^, and for all a^G M the set of associate 
values asv^ g(aj). 

(2) If there is the same preference Qj, in all sets asv^ g(a,) and P^, does not occur in the 
associate values of the maxima of Q^,, there is a Prioritized preference P = Pp& Qj,. 

(3) Otherwise there is no Prioritized preference. 

OUTPUT : the detected Prioritized preference or that no preference was found 

Complexity: If n denotes the number of tuples, k, and k^ the effort for mining P^ and 
Qj), respectively, above algorithm has the complexity 0(n^ -H nkj + nk^), since in 
maximal n values in A, the associate values in B are computed leading to O(n^). Fur- 
thermore, in maximal n sets the existence of the preference has to be tested 
(0(nk,)) and vice versa (O(nkj)). 

A data-driven Prioritized preference P = P^ & Qj, exists, if, firstly, there is a data- 
driven preference P^, and, secondly, in those tuples, which have equal values in A, 
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there is a data-driven preference in B. Thereby, we consider only the maximal 
values of P since users often don’t care about a second-level preference, if the priori- 
tized preference isn’t fulfilled optimal. If Pj, also occurs in the maximal values of Q^, 
a Pareto preference has been found. Therefore, we have to eliminate this situation 
here. 

In our previous example the preference P^ = POS(author, {Douglas Adams, Edgar 
Wallace}) was detected. If above algorithm detects = NEG(binder, {paperback}) 
(dom(binder) = {hardcover, paperback}) in the associate values of Douglas Adams 
and Edgar Wallace and, furthermore, P^ is not detected within the hardcover books, a 
Prioritized preference P = P^ & Qj, is found. 

5 Experimental Evaluation 

In this section we present test results and performance measurements of an efficient 
database-driven implementation of a Preference Miner prototype. 

5.1 Preference Mining Test Results 

We performed an analysis of the Preference Mining algorithms on the log data of the 
COSIMA application [3]. Over five hundred users queried the COSIMA comparison 
shop almost four thousand times. COSIMA offers shopping in the three categories 
books, cds and computer products. The application server records for each query the 
timestamp, the shop category, the preferred price interval and - depending on the 
category - title and author in the book shop, title and performer in the cd shop and 
product name and product group in the computer hardware category. We applied the 
Preference Mining algorithms on the COSIMA log data, whereby we analyzed the 
log-data for each user separately. The Preference Miner detected lots of POS prefer- 
ences and also one POS/POS preference for the shop category. Quite a few LOWEST 
preferences for price were detected by analyzing the lower price limit. NEG prefer- 
ences and even complex preferences were also detected with the Preference Miner. 
Mining preferences works very fast: on a PC with 1,3 ghz and 1,5 gb main memory 
the Preference Miner needed less than one second to detect the preferences of a user. 

Though these test results on real data show the practical usability of our techniques 
we cannot prove the correctness with it since the customer preferences are a priori 
unknown. Therefore we created synthetic log data using simulated users with prede- 
fined preference profiles. We defined 35 profiles, where each profile contains be- 
tween two and six preferences, e.g. {PI = POS(color, {blue}, P2 = LOWEST(price), 
P3 = PI & P2}. In our simulation each user queries the product database between 25 
to 50 times. In each query a preference of the considered user is chosen and a product 
database is requested with it using Preference SQL [6]. The results are stored in a log 
database. Afterwards we use the Preference Mining algorithms to detect preferences 
within the log data. A comparison of the detected preference profiles with the prede- 
fined user preferences will show the effectiveness of the Preference Mining algo- 
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rithms. To assess the quality of our results we define preference precision and prefer- 
ence recall. 

Definition 8 (Precision and recall for preferences) 

Preference precision and preference recall are defined as 

. . number of correctly detected preferences of user i 

precision = 

number of all detected preferences of user i 

_ number of correctly detected preferences of user i 
recall — "' ■ ■ ■ 

number of all preferences of user i 

The algorithms for mining base preferences lead to a 60 % precision and a 39 % re- 
call averaged over all users. This means that 60 % of the detected preferences occur 
exactly in the predefined preference profiles and 39 % of the predefined preferences 
are detected with our algorithms. Since a preference is regarded as correct only if all 
underlying information (POS-set, NEG-set, around-value, etc.) is detected correctly, a 
60 % precision and a 39 % recall are very promising. Mining complex preferences 
yield to 55 % precision and 15 % recall. The poor recall here is caused by dependen- 
cies between preferences: if a base preference of a complex preference is not de- 
tected, it is not possible to detect the complex preference itself. Note, that we filled 
the log relation with the search results. In real-life applications even better Preference 
Mining results can be achieved, if the selected results or query information is used. 



5.2 Performance Measurements 



In this section we analyze the efficiency of the Preference Miner prototype for large 
data sets. The underlying database system is an Oracle 8i database server on an AMD 
CPU with 1,3 ghz and 1,5 gigabyte main memory. For our tests, we created relations 
with 10,000, 20,000, 30,000, 40,000 and 50,000 tuples of synthetic data. Categorical 
attributes contain 20 different categories. Numerical attributes have a data range of 
200 (maximal minus minimal value). For mining complex preferences we assume one 
categorical and one numerical attribute. Fig. 1 reports the average runtimes for detect- 
ing a single preference for the different preference types w.r.t. the number of tuples. 




1 ()()()() 20000 30000 40000 50000 

Number of tuples 



Categorical P. 
EXPLICIT P. 
Numeric P. 
Prioritized P. 
Pareto P. 



Fig. 1. Runtimes for detecting a single preference for the different preference types 
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Mining numerical preferences is the fastest task, since histograms can be computed 
very efficiently in the database layer. The miner for categorical data needs more ef- 
fort since clustering is a more expensive iterative process. Mining Prioritized and 
Pareto preferences needs about 5 seconds in the average. The most expensive algo- 
rithm is the miner for EXPLICIT preferences (algorithm 2). The cost-intensive part is 
the cycle test and leads to a performance which depends linearly to the number of 
tuples. The efficiency of our Preference Mining algorithms allows their usage for 
online Preference Mining: while interacting with a customer an e-application can 
check online his preferences and react flexible to his wishes during the sales process. 

6 Summary and Outlook 

In this paper we have presented a novel approach for mining preferences from user 
log data based on the concept of strict partial order preferences. We presented several 
algorithms for the detection of categorical, numerical and complex preferences. Our 
prototype implementation executes all data-intensive operations on the database 
server and exhibits excellent efficiency. Our experimental results also demonstrate 
promising precision and recall of the detected user preferences. 

Our next steps include the integration of user situations into preferences. Situations 
can be described with a set of parameters like current time and location, the user’s 
role or physical and psychological condition of the user. Some preferences may only 
be relevant under specific situations; for example, in a bookshop a user may have 
different preferred categories whether he is at work or at home. A major task is the 
adaptation of our Preference Mining algorithms in order to detect situated prefer- 
ences. 

Another research task is the design of an appropriate storage structure for prefer- 
ences. Such a Preference Repository should not only be able to record preferences 
detected with the Preference Miner but also preferences defined with Preference SQL 
or Preference XPATH. The integration of situations should also be possible as well as 
user identifiers to assign users and user groups. Finally, the Preference Repository 
shall also include a set of appropriate access operations for inserting, deleting and 
updating preferences. It can also be used to find users with similar preferences and 
with it product recommendations based on preferences can be offered. Therefore the 
Preference Repository is also a major step towards advanced personalized applica- 
tions. 



Acknowledgements 

This work is partially supported by the German Research Foundation DFG within the 
research group “Efficient Electronic Coordination in the Service Sector”. 




216 Stefan Holland, Martin Ester, and Werner KieBling 



References 

1. D. Beeferman and A. Berger: Agglomerative Clustering of a Search Engine Query Log. 
Proc. ACM SIGKDD 2000, p. 407-416, Boston, Massachusetts, USA, 2000. 

2. V. Estivill-Castro and M. E. Houle: Robust Distance-Based Clustering with Applications to 
Spatial Data Mining. In 3rd Pacific-Asia Conference on Knowledge Discovery and Data 
Mining, p. 327-337, Beijing, China, 1999. 

3. S. Holland, M. Ester and W. KieBling: Preference Mining: A Novel Approach on Mining 
User Preferences for Personalized Applications. Technical Report 2003-5, Institute of 
Computer Science, University of Augsburg, May 2003. http://www.informatik.uni- 
augsburg.de/nav/forschung. 

4. T. Joachims: Optimizing Search Engines using Clickthrough Data. Proc. ACM SIGKDD 
2002, Edmonton, Alberta, Canada, 2002. 

5. W. KieBling: Foundations of Preferences in Database Systems. Proc. VLDB 2002, p. 311- 
322, Hong Kong, China, 2002. 

6. W. KieBling and G. Kdstler: Preference SQL - Design, Implementation, Experiences. Proc. 
VLDB 2002, p. 990-1001, Hong Kong, China, 2002. 

7. S.-J. Ko, J.-H. Lee: User Preference Mining through Collaborative Filtering and Content 
Based Filtering in Recommender System. Proc. of the 3rd Intern. Conf. on E-Commerce 
and Web Technologies (EC-Web 2002), p. 244-253, Aix-en-Provence, Prance, 2002. 

8. R. Kohavi: Mining E-Commerce Data: The Good, the Bad, and the Ugly. Proc. ACM 
SIGKDD 2001, p. 8-13, San Prancisco, California, USA, 2001. 

9. W. Lin, S. A. Alvarez, and C. Ruiz: Efficient Adaptive-Support Association Rule Mining for 
Recommender Systems. In DMKD Journal, vol. 6 (1), p. 83-105, 2002. 

10. B. Mobasher, R. Cooley and J. Srivastava: Automatic Personalization Based on Web Usage 
Mining. In Communications of the ACM, vol. 43 (8), p. 142-151, August, 2000. 

11. G. Rossi, D. Schwabe and R. Guimaraes: Designing Personalized Web Applications. Proc. 
10th World Wide Web Conference (WWW 2001), p. 275-284, Hong Kong, China, 2001. 

12. P. J. Rousseeuw: Silhouettes: A Graphical Aid to the Interpretations and Validation of 
Cluster Analysis. Journal of Computational and Applied Mathematics, 20, 53-65, 1987. 




Explaining Text Clustering Results 
Using Semantic Structures 



Andreas Hotho, Steffen Staab, and Gerd Stumme 



Institute of Applied Informatics and Formal Description Methods AIFB 
University of Karlsruhe, D-76128 Karlsruhe, Germany 
{hotho, staab, stumnie}@ai fb.uni-karlsruhe.de 
http ; / /WWW. aifb . uni-karlsruhe . de/WBS 



Abstract. Common text clustering techniques offer rather poor capabilities for 
explaining to their users why a particular result has been achieved. They have the 
disadvantage that they do not relate semantically nearby terms and that they cannot 
explain how resulting clusters are related to each other. In this paper, we discuss 
a way of integrating a large thesaurus and the computation of lattices of resulting 
clusters into common text clustering in order to overcome these two problems. As 
its major result, our approach achieves an explanation using an appropriate level 
of granularity at the concept level as well as an appropriate size and complexity 
of the explaining lattice of resulting clusters. 



1 Introduction 



Clustering is an important task that is performed as part of many text mining and in- 
formation retrieval systems. It can be used for efficiently finding the nearest neighbors 
of a document [1], for improving the precision or recall in information retrieval sys- 
tems [15,11], for aid in browsing a collection of documents [3,8], as well as for the 
organization [19] and personalization of search engine results [13]. 

Most current document clustering approaches are based on the vector-space model 
(also called bag of words model or word space), the dimensions of the vector space 
are constituted by the important words of the document collection. The respective term 
or word frequencies (TF) in a given document constitute the vector describing this 
document. In order to discount frequent words with little discriminating power, each 
word can additionally be weighted based on its Inverse Document Frequency (IDF) in 
the document collection. Once the documents are mapped into the vector space, they can 
he clustered according to the distances between the vectors. However, what is neglected 
in these approaches are the explanations of why particular clusters have been formed 
and how the different clusters are related to each other. 

To elaborate on this, we build on a hnding by Karypis and Han. They have shown 
[10] that words occurring with high weights in the centroid of a cluster can be used to 
summarize the content of the cluster. They observed that “prevalent terms of the various 
centroids often contain terms that act as synonyms within the context of the topic they 
describe.” Common text clustering algorithms lack the capabilities 
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(A) to recognize such synonyms in order to improve text clustering quality; 

(Bl) to use such synonymity in order to improve the quality of the explanation of why a 
cluster has been formed (e.g., just state ‘this cluster is about Volkswagen’ instead 
of stating ‘this cluster is about Volkswagen and about VW’); 

(B2) to exploit semantic hierarchies of words in order to abstract an explanation (e.g., 
instead of ‘this cluster is about pork and beef and veal’ state ‘this cluster is about 
meat’); 

(C) to give an account of how resulting clusters are related (e.g. ‘clusterl is about the 
same topics as cluster2, but additionally about meat’). 

With regard to (A), we have investigated how a large thesaurus like WordNet [5] 
may help to improve text clustering results exploiting synonymity and other semantic 
relationships^. 

With regard to (Bl) and (B2), we use the synonymity of words and the hierarchy of 
their corresponding concepts as defined in a thesaurus^ to come up with more concise 
and abstracting explanations. We extract explanations from the centroid representation 
of resulting clusters, but we also use thesaurus concepts instead of words only. 

Finally, with regard to (C), we have explored the use of lattice theory. Formal concept 
analysis (FCA) [6] computes the place of an object representation (e.g. the representa- 
tion of a text or the representation of a text cluster) in a lattice according to its vector 
representation. Unfortunately, formal concept analysis (and similar means of analysis) 
are not suited to directly relate vector representations of large collections of texts. Tests 
with text samples have revealed that even homogeneous text sets have relatively few joint 
word occurrences leading to large and complex lattices with unsatisfying explanatory 
power^. 

In this paper, we address the challenges listed above. Here, we will specifically 
discuss fhe challenges (B) and (C). Our approach proceeds along the following lines: 

1. It represents text documents by a vector model that exploits the hierarchy of the 
concepts in the WordNet thesaurus (cf. Section 2); 

2. it uses a common text clustering algorithm, BiSec-/c-Means, to aggregate texts 
without supervision into a pre-defined number of clusters. Moreover it extracts a 
representation of each resulting cluster (cf. Section 3); 

3. it computes a lattice from the resulting cluster representations to relate, (i) words 
from the thesaurus hierarchy with the different clusters and, (ii) to compare the 
different cluster representations (cf. Section 4); 

4. eventually, it visualizes (parts of) the resulting lattice structure(s) allowing the user 
to explore explanations of how and why clustering results have been produced (cf. 
Section 5). 

* A comprehensive empirical investigation on how a thesaurus may improve text clustering is 
reported in [9]. 

^ What is typically called a ‘concept’ in ontologies or thesauri is a very close match to what is 
called a ‘synset’ in WordNet. 

^ Result size complexity of FCA may become a problem if the lattice is computed on several 
thousands of texts, as the number of nodes in the lattice may grow exponentially with the 
number of objects. In practice runs, however, we have observed that formal concept analysis 
scaled quite well with regard to the number of texts — much better than worst case. 
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We discuss our approach along the Reuters-21578 text collection. Section 6 pro- 
vides an overview of related work. In particular, we emphasize that while the individual 
algorithms of steps 1 through 4 are well-known and to some extent interchangable with 
likewise approaches, the combination we describe here is unique and original, while it 
serves frequently arising objectives in text clustering. 

2 Representing Texts 

This section describes the representation of texts in an exemplary manner drawing from 
the Reuters-21578 dataset on which we performed many of our experiments. The basic 
idea of this section is the extension of the common text representation as a vector in a 
word space towards a representation in a word/concept space. 

The Reuters-21578 Dataset We selected the Reuters-21578 text collection for our ex- 
periments. The corpus consists of 21578 documents. This corpus is especially interesting 
for evaluation, as part of it comes along with a (hand-crafted) classification. It contains 
135 so-called topics. To be more general, we will refer to them as ‘classes’ in the se- 
quel. For allowing evaluation, we restrict ourselves to the 12344 documents which have 
been classified manually by Reuters. Some of them could not be assigned by the experts 
to one of the predefined classes; we collect them in an additional class ‘defnoclass’ . 
Reuters assigns some of its documents to multiple classes, but we consider only the first 
assignment. After these steps, we obtain our final corpus T> for evaluation. It consists of 
the 12344 documents, distributed over 82 Reuters topics. 

Preprocessing the Document Set For the preprocessing of the documents, we used the 
text mining system developed at AIFB within the KAON^ framework. We performed 
the following steps on the selected corpus: First we lowered the letters of all words 
and removed stopwords. We used a stopword list with 571 entries which removed 416 
stopwords from the documents. We also dropped all words with less than 30 occurrences 
over the whole corpus. 17917 words were removed in total. After these steps, 2657 
different words remained in our list, with a total occurrence of 784434. 

WordNet as Background Knowledge Instead of using a bag-of-word model directly, we 
additionally enriched the text representation with background knowledge. The basic 
idea is to replace the words by concepts and their broader concepts as defined in a given 
thesaurus, in order to capture similarities at various levels of generalization. For this 
purpose we needed a resource suitable for the Reuters corpus. We choose WordNet® 
as our background knowledge. WordNet consists of so-called synsets, together with a 
hypernym/hyponym hierarchy^ . 

To modify the existing word vector representations of text, we have first replaced 
all nouns that appeared in the documents and that were known by WordNet by the 
corresponding concept (‘synset’) identifiers from WordNet. At this point, we had several 
choices of, (i), how to deal with terms not known by WordNet (delete or keep), (ii), 
how to deal with ambiguity (one word in the document, like ‘bank’, may correspond 

* http://www.daviddlewis.com/resources/testcollections/ reuters21578/ 

^ http://kaon.semanticweb.org 

® http://www.cogsci.princeton.edU/~ wn/ 

^ See http://www.cogsci.princeton.edU/~ wn/manl. 7. 1/wngloss. 7WN.html for a glossary. 
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to several concepts in WordNet), and, (Hi), how many generalizations of a concept to 
consider to use for the text representation. We have elaborated on these choices in [9] 
and present here only a simple, but quite effective combination. 

In the simplest case, (i), we have ignored all words that either were not nouns or 
that were not known by WordNet. (ii), we have used a disambiguation method provided 
by WordNet. WordNet has a ranking of what is the ‘most common’ meaning for a word 
in English. We here use only this static ranking to map a word onto corresponding 
concepts, (in), we have mapped a word occurrence in a document to its most highly 
ranked concept in WordNet as well as to the four most specific generalizations of this 
concept. For instance, the occurrence of ‘bank’ in a document would increase the vector 
entry corresponding to ‘banking company’ (the concept) as well as the vector entries 
corresponding to ‘financial institution’, ‘institution’, ‘organization’ and ‘social group’ 
as these are the four most specific generalizations of ‘banking company’ . The concepts 
that were assigned to at least one document formed then the new set T of terms used for 
describing the documents, i. e., they constitute the dimensions of the vector space for 
the new text representation. 

Enriching the term vectors with concepts from WordNet has two benefits. First it 
resolves synonyms; and second it introduces more general concepts which help iden- 
tifying related topics. For instance, a document about ‘bank’ may not be related to a 
document about ‘insurer’ by the cluster algorithm if there are only ‘bank’ and ‘insurer’ 
in the term vector. But if the more general concept ‘financial institution’ is added to both 
documents, their semantical relationship is revealed. 

In the remainder of this paper, we will use the expression ‘term’ both for words and 
for concepts (synsets) of the thesaurus for sake of simplicity. If we talk about one of 
them specifically, we will mention it explicitly. 

Building the Term Vectors Based on the work done so far, we built a term vector for 
each document d GT>. For each document, the terms t G T are weighted by tfidf (term 
frequency x inverse document frequency) [16], which is defined as follows: tfidf(d, t) = 
tf(d, t) X log > where tf(d, t) is the frequency of term t in document d, and T>t <GT> 

is the set of all documents containing term t. The term vector for document d is then the 
tuple Wd := (tfidf(d, f))ter- 

Tfidf weighs the frequency of a term in a document with a factor that discounts its 
importance when it appears in almost all documents. Therefore terms that appear too 
rarely or too frequently are ranked lower than terms that hold the balance and, hence, 
are expected to be better able to contribute to clustering results. 

After this first step, we have thus obtained a description of all documents, which is 
enriched by background knowledge, and which will also allow to relate semantically 
close (but syntactically different) documents. 

3 Text Clustering and Feature Extraction 

In this section, we show how to cluster our, so far uncommon, text representations with 
state-of-the-art methods (cf. [17]). The major output of this section is an explanation of 
the text clusters achieved by known means, like [10], which is then used as an input for 
analysis and explanations in subsequent sections. 
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The reader may note that while we present a specific, effective and efficient, approach 
for text clustering and extraction of cluster representations here, this approach might be 
replaced by other methods that digest similar input and produce similar output without 
changing our principal approach. 

3.1 Text Clustering with BiSec-fc-Means 

On the preprocessed data we applied BiSec-Zc-Means [17], a ‘bisecting’ variant of k- 
Means, using the cosine similarity: the similarity between two documents di, c ?2 G 2^ is 
calculated as the cosine of the angle between their term vectors and Wfi^ . 

For this clustering step we need a fast algorithm (such as /c-Means), which is able 
to deal with large datasets, which should also provide a reasonable accuracy. Instead of 
a slow agglomerative clustering technique with a good accuracy we choose BiSec-fc- 
Means which tends to give better results as /c-Means and is sometimes also better as 
agglomerative clustering, while it is as fast as /c-Means (cf. the seminal paper [17]). 

BiSec-fc-Means is based on the fc-Means algorithm. It repeatedly splits the largest 
cluster (using fc-Means) until the desired number of clusters is obtained. As input, it 
takes the list {wd)d^'D of document descriptions and the number k of desired clusters. 
As output, it provides a partitioning C of of the set T> of documents (i. e., a set C of k 
disjoint subsets of V with UceC ^ ~ Each cluster C G C is represented by its 
centroid in < 7 . 

3.2 Extracting Cluster Descriptions 

For a good explanation of results, it is necessary to detect important terms that are concise 
about the explanation created. The basic idea of mechanisms like latent semantic indexing 
[4] or concept indexing [10] is that the ‘importance’ of a component can be derived from 
the weight it receives by an analysis (be it singular value decomposition or k-means 
clustering, respectively). Correspondingly, we here rank the importance of terms for 
explaining clustering results based on the weights they have in the cluster centroids. 

In order to be able to control how many terms remain to describe the clusters and, 
hence, be concise, we discretize the term ranks into three descriptions ‘very important’, 
‘important’, or ‘uninteresting’ by two thresholds 0i, 6 * 2 . In our running example, we set 
01 to 7 % and 02 to 20 % of the maximal value. We can then explain the clustering results 
by considering the terms that are at least ‘important’ for a resulting cluster. 

3.3 Examining BiSec-fc-Means Explanations on the Reuters-21578 Dataset 

Table 1 shows the highest ranked terms from the centroids of ten (out of 100) resulting 
clusters on the Reuters-21578 dataset together with their value in the respective centroid. 
All listed values are above the lower threshold 0i = 7% (i.e., they are at least ‘important’). 
In general the set of terms that exceed the threshold is much larger (e.g., up to 50 terms 
of a cluster centroid exceed 0 i) than the set that can be listed here. 

A general overview of these results reveals that it is hard to understand the results. 
While some part of the difficulty stems from the simple, tabular way in which it is 
presented to the user, quite a substantial part of the difficulty comes from the sheer fact 
that there are only few meaningful structures that can be represented to the user at all. To 
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Table 1. The highest ranked terms in the first ten out of 100 clusters resulting from a BiSec-fc- 
Means run on the Reuters-21578 dataset (ordered hy their values in the respective centroids). 



Cluster 0 


Cluster 1 


Cluster 2 


Cluster 3 


Cluster 4 


amount 


0,12 


depository financial instil 


0,09 


loss 


0,34 


Irani, Iranian, Persian" 


0,14 


Indebtedness, liability, fin 


0,12 


billion, one million million 


0,11 


financial Institution, finan 


0,09 


failure 


0,33 


Iran, Islamic Republic of 


0,13 


obllqatlon 


0,12 


larqe Inteoer* 


0,11 


rate, charqe per unit" 


0,09 


nonaccomplishment, non 


0,32 


quif 


0,13 


debt 


0,12 


inteqer, whole number" 


0,11 


charqe 


0,09 


Connecticut, Nutmeq Sta 


0,28 


vessel, watercraft’ 


0,12 


written aqreement' 


0,1 


Insufficlencv, Inadequacy 


0,1 


Institution, establishment 


0,09 


ten, 10, X, tenner, decac 


0,24 


ship 


0,12 


aqreement, understandin 


0,08 


deficit, shortaqe, shortfal 


0,1 


loss 


0,08 


American state" 


0,23 


craft 


0,12 


creditor 


0,08 


number 


0,09 


monetary unit" 


0,07 


state, province" 


0,22 


Aslan, Asiatic" 


0,11 


lender, loaner" 


0,08 


excess, surplus, surplusa 


0,09 


central, telephone excha 


0,07 


system, unit’ 


0,19 


person of color, person o 


0,10 


statement 


0,07 


overabundance, overmuc 


0,09 


financial loss' 


0,06 


network, net, mesh, mes 


0,19 


Asian country, Asian nat 


0,10 


billion, one million million 


0,06 


abundance, copiousness 


0,09 


outqo, expenditure, outia 


0,06 


September, Sep, Sept’ 


0,18 


oil tanker, oiler, tanker, tc 


0,10 


larqe Inteqer" 


0,05 



Cluster 5 


Cluster 6 


Cluster 7 


Cluster 8 


Cluster 9 


text, textual matter' 


0,15 


loss 


0,34 


qross sales, qross reven 


0,11 


tender, leqal tender" 


0,15 


metric weiqht unit, weiqh 


0,15 


matter 


0,15 


failure 


0,33 


sum, sum of money, amo 


0,09 


offer, offerinq" 


0,14 


metric ton, MT, tonne, t" 


0,15 


letter, missive" 


0,15 


nonaccomplishment, non 


0,32 


income 


0,09 


medium of exchanqe, mo 


0,11 


mass unit' 


0,14 


siqn, mark' 


0,13 


common fraction, simple 


0,22 


financial qaln" 


0,09 


speech act" 


0,1 


palm, thenar" 


0,14 


clue, clew, cue" 


0,13 


fraction 


0,22 


qain 


0,09 


indicator 


0,1 


area, reqion" 


0,12 


purpose, intent, intention 


0,11 


rational number" 


0,22 


enterprise 


0,05 


standard, criterion, meas 


0,1 


unit of measurement, uni 


0,10 


evidence 


0,11 


real number, real" 


0,22 


business, concern, busin 


0,05 


reference point, point of r 


0,09 


organic compound" 


0,10 


Indication, indicant" 


0,11 


complex number, comple 


0,22 


assets 


0,05 


siqnal, siqnallnq, sIqn’ 


0,08 


oil 


0,10 


qoal, end" 


0,1 


one-half, half 


0,22 


division 


0,05 


acquisition 


0,06 


lipid, llpide, lipoid" 


0,10 


writinq, written material, c 


0,07 


revolutions per minute, tr 


0,22 


army unit" 


0,05 


qiant 


0,06 


compound, chemical corr 


0,08 



substantiate this claim, let us investigate in detail what kind of structures are available 
for subsequent explanation in a visualization tool and which are not. 

E.g., one may recognize from Table 1 that clusters 2 and 6 are similar because they 
both are about ‘loss’, ‘failure’ and ‘non-accomplishment’. Also, the human observer 
may interpret a cluster description like the one of cluster 1, in order to guess that the 
list ‘depository financial institution’, ‘financial institution’, ‘rate’, ‘charge’, ‘institution’, 
‘loss’, ‘monetary unit’, ‘financial loss’, ‘expenditure’ probably means that this cluster 
is about financial transaction with loss (which is also correct when investigating the 
corresponding Reuters news documents). While these structures are not that easy to find 
in the tables, it is not hard to imagine a user interface to facilitate their discovery. 

However, there are meaningful structures that are more difficult to find. For instance, 
the occurrence of ‘oil’ relates Cluster 3 (ranked further down in the list) with Cluster 
9 and several other clusters (from the set of clusters 10 to 99). Along similar lines, it 
would be nice to see how switching from a general concept like ‘chemical compound’ 
to a more specific one like ‘oil’ switches the set of associated clusters. Eventually, one 
would like to find that a particular type of oil or a term like ‘palm’ is a unique property 
of Cluster 9 as compared against all other clusters. Such structural dependencies require 
a further analysis as we propose in the following sections. 

Eventually, we want to summarize the problems encountered by extracting explaining 
terms from cluster centroids: This model, if used on its own, assumes that the ranking 
of terms adequately reflects the importance of terms, which is often not the case (e.g., 
for Cluster 6 it remains unclear what type of ‘loss’ is encountered). In fact, importance 
frequently depends on what terms help to explain commonalities or differences between 
clusters — an analysis provided by the next step. 

4 Computing the Lattice of Cluster Representations 

The clusters obtained by the previous step have the advantage that they cluster similar 
documents. However, the clustering does not give a description of how the clusters are 
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related to each other, i.e. an explicit account of what their commonalities and differences 
are. Formal Concept Analysis derives a lattice that incorporates this account. 

4.1 Formal Concept Analysis 

Formal Concept Analysis (FCA) was introduced for modeling the concept ‘concept’* in 
terms of lattice theory. We recall the basics of FCA as far as needed for this paper. An 
extensive overview is given in [6] . To allow a mathematical description of concepts as 
being composed of extensions and intensions, FCA starts with a formal context. 

Definition: A formal context is a triple K := (G, M, I), where G is a set of objects, M 
is a set of attributes, and / is a binary relation between G and M (i.e. I C G x M). 
{g, m) G / is read “object g has attribute m”. 

A straightforward way of modeling our problem in FCA would be to let the set of 
objects consist of all clusters determined in the previous step, i. e., G := C and let the set 
of attributes consists of all terms which remain from the step described in Section 3.3, i. e., 
M := Tc- In order to obtain a more fine-grained view, we additionally apply conceptual 
scaling. We use the two thresholds of our example and impose an ordinal scale on the 
object set with two thresholds 0i and 02- The formal context (G, M, I) is then composed 
as follows: G := C x {6i,02}, M := Tc, and ((G, 9i),t) € I : ('tc)t > The 

relation I, applied to a pair (G, Of), returns thus the set {(G, 0i)}' of all attributes which 
are more or less (i. e., with threshold 0i) relevant for cluster G. From a formal context, 
a concept hierarchy, called concept lattice, can then be derived: 

Definition: For A C G, we define A' := {m G M | Vp G A: G /} and, for 

B C M, we define B' := {g G G \ Vm G B : {g,m) G I}. 

A formal concept of a formal context (G, M, I) is defined as a pair (A, B) with 
A C G, B C M, A' = B and B' = A. The sets A and B are called the extent and the 
intent of the formal concept (A,B). The subconcept-superconcept relation is formalized 
by 

{Ai,Bi) < (A 2 ,i? 2 ) A1CA2 Bi A B2) . 

The set of all formal concepts of a context K together with the partial order < is always 
a complete lattice®, called the concept lattice of K and denoted by jB(K). 

The resulting concept lattice can also be interpreted as a concept hierarchy directly on 
the documents, as it is isomorphic to the concept lattice of the context K' := (G', M' , /') 
with G' := T>, M' := Tc, and (d, f) G /' iff d G G and {wc)t > for some cluster 
G G C. This context is an approximation of the descriptions of the documents by term 
vectors, with the property that all documents in one cluster obtain exactly the same 
description. 

Clustering the objects before applying FCA is an abstraction that might be considered 
a loss of information. However, it is predominantly beneficial for the following reasons. 
Firstly, it reduces the number of objects such that FCA becomes more efficient. Secondly, 

* ‘concept’ in FCA is a different notion than ‘concept’ in a thesaurus or ontology. 

® I. e., for each set of formal concepts, there exists always a unique greatest common subconcept 
and a unique least common superconcept. 
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the technique is robust with regard to upcoming documents: A new document is first 
assigned to the cluster with the closest centroid, and then finds its place within the concept 
lattice. If on the contrary the document would be considered directly for computing the 
concept lattice, it could not be guaranteed that the structure of the lattice would not 
change. Finally and most importantly, formal concept analysis applied directly on all 
documents suffers from the low co-occurrence of terms. The application of FCA on the 
Reuters-21578 dataset has shown that hardly any two texts are placed into a common 
node of the lattice. Thus, the lattice became large, unwieldy and hard to understand for 
the human user. Therefore, in our approach we first cluster a large number of texts (e.g. 
10^) into a more manageable number of clusters (e.g. 10^). Only then we compute the 
lattice, allowing for abstraction from some randomness in the joint occurrences of terms. 



4.2 The Lattice of the Reuters Clusters 

In the Reuters setting, we obtain from the representation computed in the previous sec- 
tion a list of over hundred formal concepts. Each of them groups together clusters of 
the previous steps. This grouping indicates the conceptual similarity of the clusters. 
E.g., we obtain a formal concept, which we here refer to by (*), that has {CL 3 (m), 
CL9(m), CL23(m), CL79(m), CL85(m), CL95(m)} as extent, and (organic com- 
pund, oil, ‘lipid, lipide, lipoid’, ‘compound, chemical compound’} as intent'®. This 
formal concept indicates the commonalities (‘conceptual similarity’ when calling it by 
ECA terminology) of these clusters: the majority of documents within these clusters are 
about oil. 

The formal concept (*) has three direct subconcepts: the first has {CL3(m)| as 
extent, and the attributes from above plus some attributes like ‘oil tanker’ and ‘Iranian’ 
as intent. The second has {CL9(m)| as extent, and the attributes from above plus 
some attributes like ‘area’, ‘palm’, and ‘metric ton’ as intent. The third subconcept has 
(CL 23 (m), CL 79 (m), CL 85 (m), CL 95 (m)| as extent, and the attributes from above 
plus ‘substance, matter’ as intent. These three subconcepts of (*) show what distinguishes 
the clusters grouped together in the formal concept (*). The majority of documents in 
Cluster 3 are about transport of oil (from Iran), those in Cluster 9 about (packaging of) 
palm oil, and those in the remaining clusters about crude oil. 

This example shows that the lattice computed on the resulting clusters can in fact 
provide meaningful explanations about the commonalities and differences of the set of 
clusters — beyond what could be provided in Section 3.3. Since it remains inconvenient 
to figure out these structures just from the list of formal concepts, we furthermore exploit 
techniques for visualizing the computed lattice in the next step. 



5 Visualizing the Concept Lattice 

We make use of Hasse diagrams for visualizing the concept lattice. They follow the 
conventions for the visualization of hierarchical concept systems as established in the 
international standard ISO 704. Eigure I highlights a part of the concept lattice of our 

(m) stands here for the important and (h) for the very important terms with the higher threshold. 
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Fig. 1. The resulting conceptual clustering of the text clusters (visualized for the clusters related 
to chemical compounds. 



context by a Hasse diagram. It will be explained in detail below. The lattice was computed 
and visualized using the Cernato software of NaviCon Gmbh". It shows all clusters 
where the value of the synset ‘compound, chemical compound’ in the centroid is above 
the threshold 9 i = 7 %. 

In a Hasse diagram, each node represents a formal concept. Due to technical reasons, 
we reverse the usual reading order: A concept Ci € S(K) is a subconcept of a concept 
C2 € S (K) if and only if there is a path of descending( !) edges from the node representing 
Cl to the node representing C2. 

The name of an object g is always attached to the node representing the most specific 
concept (i. e., the smallest concept with respect to <) with g in its extent (i. e., in our 
figure, the highest such node); dually, the name of an attribute m is always attached to 
the node representing the most general concept with m in its intent (i. e., the lowest such 
node in the diagram). We can always read the context relation from the diagram, since 
an object g has an attribute m if and only if the concept labeled by 5 is a subconcept 
of the one labeled by m. The extent of a concept consists of all objects whose labels 
are attached to subconcepts, and, dually, the intent consists of all attributes attached to 
superconcepts. 

For example, the concept in the lower middle of the diagram labeled by ‘oil’ is the 
concept (*) that we encountered above. In the diagram, we can see that it is part of a 
chain of concepts with increasing specificity. The most general of them (beside the top 

** http://www.navicon.de 
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concept) contains in its extent clusters of documents addressing chemical compounds 
(with a medium occurrence) : Clusters 3, 9, 23, 39, 79, 85, and 95. The next concept is 
the concept (*). Its extent is restricted to those clusters related to oil: all clusters from 
above heside Cluster 39. We already discussed the subconcepts of (*) above. In the 
diagram one can see that there are in fact exactly three subconcepts. The one of them in 
which crude oil is considered, i. e., the one containing the Clusters 23, 79, 85, and 95 
in its extent, branches again: While no more information is available about cluster 79, 
the documents in Cluster 23 and 95 are about the transport and cluster 95 additionally 
about the oil quotas of the OPEC organization. 

Let us also analyze possible problems our approach may encounter: Cluster 85 is also 
about oil but the intent of the concept is labeled by ‘gas’. A closer look to the concepts 
of the label reveals an additional topic in cluster 85 namely ‘gas’ as a state of matter 
which is the first sense in WordNet. An inspection of the documents reveals a mistake 
of our disambiguation strategy. In the actual documents gas was used as a synonym of 
gasoline and not as a state of matter. Additionally, some important words are missing 
in our cluster description which would better explain the content. An important one is 
‘refinement’. It has a weight that is marginally below the threshold. Thus, we here miss 
the explanation that Cluster 85 is about the refinement of crude oil to gasoline. 

To conclude, in the visualization of the concept lattice, we are able to navigate the 
structures explaining commonalities and differences between different clusters such as 
manifested in the lattice computed by FCA. The lattice extends the set of meaningful, 
explanatory structures by means that relate clusters to each other and that exploit the 
hierarchy defined in the thesaurus for this purpose as a side effect. On the other hand, 
without BiSec-fc-Means as a preceeding step, the FCA step would not have produced 
a lattice of reasonable and understandable size, because individual texts are too volatile 
what concerns the joint occurrence of relevant terms. As shown, our approach thus 
combines the thorough analysis of FCA with the reduction of term and document space 
to a concise, but relevant basis. 

6 Related Work 

As just summarized, the orginality of our approach is not so much based on the individual 
algorithms used, as vector representations, BiSec-fc-Means, Formal Concept Analysis 
and Hasse Diagramms are all well known, but on their original integration. This inte- 
gration serves the purpose to achieve a careful balance with regard to the granularity 
of information used for explanation in three dimensions. First, it automatically finds 
the adequate level of generalization of concepts in the thesaurus (e.g., ‘financial institu- 
tion’ instead of ‘bank’, whereby only the latter actually appears in the texts). Second, it 
restricts the term space to a subspace. Thereby, the major components in the cluster cen- 
troids are terms that are particularly able to group and discriminate larger text subsets. 
Third, our approach restricts the document space to a subspace. The subspace abstracts 
from outlying non-occurrences of individual terms (e.g., one document being about ‘fi- 
nancial losses in business acquisitions’, but only exhibiting the ‘loss’ information and 
not ‘company B burned money’). 

Our experiments have shown, that in order to come up with a concise, but elab- 
orate description, one must carefully balance these dimensions before computing a 
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lattice-based, hence expressive, explanation. Against this background we may compare 
paradigms related to our approach. 

Hierarchical text clustering, hy agglomerative or by partitional algorithms, may he 
used to derive a tree of cluster representations. One document is in general not only 
found in one cluster, but it is assigned to a hierarchy of clusters. The hierarchy with 
its clusters at different levels of representations may he used to describe how clusters 
are similar or different. Unlike FCA, however, the tree-like hierarchy does not allow 
multiple assignment of categories, as is common for text documents, e.g. Reuters news. 

We have explored explanations produced by applying common rule/decision tree 
learners like Ripper [2] or C4.5 [14] on resulting clusters. While the result that these 
algorithms produce may be very good to classify into the categories they learn, they tend 
to produce a larger number of rules to explain a single resulting cluster. The explanation 
for 100 clusters appears to be rather unmanageable for a human user. 

We have mentioned that conceptual clustering techniques might be applied directly 
on the text representations instead of on the text cluster representations. We have also 
mentioned that there arise problems because of the large term and document space. There 
are means to reduce the term and document space based on counting support for formal 
concepts (e.g., Titanic [18]). However, this type of algorithm is rather new and we don’t 
know about work that would have applied it to texts in any way. 

Latent Semantic Indexing (LSI; cf. [4,7]) constitutes a paradigm that groups words 
into ‘concepts’ based on their cooccurrences in a given dataset. LSI then allows for 
text clustering or classification taking into account these ‘concepts’. Compared against 
our approach, its biggest advantage is that a thesaurus is not needed, but the largest 
disadvantage of LSI is that the notion of ‘concept’ that LSI introduces cannot easily be 
explained to a common user. Also, an explanation by a more general concept (the ‘first 
dimension’ sketched above) is not possible. 

Finally, Karypis and Han [10] have built on the principal idea of reducing term 
space from LSI, but introduce ‘concepts’ that are based on the automatic clustering of 
words. They achieve performance quality comparable to LSI, but their method is more 
accessible to a human user and might be integrated in the future with a manually defined 
core thesaurus. We see here a promising line of further research combining their idea 
of concept construction by clustering (also found in other areas like ontology learning 
[12]) with its immediate use in text classification and clustering as well as in explaining 
clustering results as we have proposed in this paper. 



7 Conclusion 

In this paper, we presented a novel combination of known techniques for text clustering. 
First, we extended the typical vector space representation of text by synsets of WordNet, 
in order to exploit its semantics. Then we clustered the documents with the BiSec- 
/c-Means algorithm, using the cosine for measuring the similarity of documents. For 
each cluster, we extracted a conceptual description, which we used for arranging the 
clusters in a lattice using Formal Concept Analysis. This blend of known techniques 
has been shown to combine the benefits of each of the techniques involved; WordNet 
provides means to identify apparently different terms on a higher level of abstraction; 
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BiSec-fc-Means structures the domain and reduces it to a manageable size; the extracted 
cluster descriptions help identifying the content of individual clusters; and FCA and its 
visualization means show up the relation between those clusters. 
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Abstract. Many effective and efficient learning algorithms assume inde- 
pendence of attributes. They often perform well even in domains where 
this assumption is not really true. However, they may fail badly when 
the degree of attribute dependencies becomes critical. In this paper, we 
examine methods for detecting deviations from independence. These de- 
pendencies give rise to “interactions” between attributes which affect 
the performance of learning algorithms. We first formally define the de- 
gree of interaction between attributes through the deviation of the best 
possible “voting” classifier from the true relation between the class and 
the attributes in a domain. Then we propose a practical heuristic for 
detecting attribute interactions, called interaction gain. We experimen- 
tally investigate the suitability of interaction gain for handling attribute 
interactions in machine learning. We also propose visualization methods 
for graphical exploration of interactions in a domain. 



1 Introduction 

Many learning algorithms assume independence of attributes, such as the naive 
Bayesian classifier (NBC), logistic regression, and several others. The indepen- 
dence assumption licenses the classifier to collect the evidence for a class from 
individual attributes separately. An attribute’s contribution to class evidence is 
thus determined independently of other attributes. The independence assump- 
tion does not merely simplify the learning algorithm; it also results in robust 
performance and in simplicity of the learned models. 

Estimating evidence from given training data with the independence assump- 
tion is more robust than when attribute dependencies are taken into account. 
The evidence from individual attributes can be estimated from larger data sam- 
ples, whereas the handling of attribute dependencies leads to fragmentation of 
available data and consequently to unreliable estimates of evidence. This increase 
in robustness is particularly important when data is scarce, a common problem 
in many applications. In practice these unreliable estimates often cause inferior 
performance of more sophisticated methods. 

Methods like NBC that consider one attribute at a time are called “myopic.” 
Such methods compute evidence about the class separately for each attribute 
(independently from other attributes), and then simply “sum up” all these pieces 
of evidence. This “voting” does not have to be an actual arithmetic sum (for 
example, it can be the product, that is the sum of logarithms, as in NBC). The 
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aggregation of pieces of evidence coming from individual attributes does not 
depend on the relations among the attributes. We will refer to such methods as 
“voting methods;” they employ “voting classifiers.” 

A well-known example where the myopia of voting methods results in com- 
plete failure, is the concept of exclusive OR: C = XOR{X,Y), where C is a 
Boolean class, and X and Y are Boolean attributes. Myopically looking at at- 
tribute X alone provides no evidence about the value of C. The reason is that 
the relation between X and C critically depends on Y. For Y = 0,C = X; for 

Y = 1,C X. Similarly, Y alone fails. However, X and Y together perfectly 
determine C. We say that there is a positive interaction between X and Y with 
respect to C . In the case of a positive interaction the evidence from jointly X and 

Y about C is greater than the sum of the evidence from X alone and evidence 
from Y alone. 

The opposite may also happen, namely that the evidence from X and Y 
jointly is worth less than the sum of the individual pieces of evidence. In such 
cases we say that there is a negative interaction between X and Y w.r.t. C. 
A simple example is when attribute Y is (essentially) a duplicate of X. For 
example, the length of the diagonal of a square duplicates the side of the square. 
Voting classifiers are confused by negative interactions as well by positive ones. 



2 Attribute Interactions 



Let us first define the concept of interaction among attributes formally. Let there 
be a supervised learning problem with class C and attributes X\, X2, .... Under 
conditions of noise or incomplete information, the attributes need not determine 
the class values perfectly. Instead, they provide some “degree of evidence” for or 
against particular class values. For example, given an attribute- value vector, the 
degrees of evidence for all possible class values may be a probability distribution 
over the class values given the attribute values. 

Let the evidence function f{C,Xi,X2,-.-,Xk) define some chosen “true” 
degree of evidence for class C in the domain. The task of machine learning is to 
induce an approximation to function / from training data. In this sense, / is the 
target concept for learning. In classification, / (or its approximation) would be 
used as follows: if for given attribute values xi,X2, ■ ■ ■ ,Xk ■ f{ci,xi,X2, ■ ■ ■ , Xk) > 
/(c2, Xi, . . . , Xk), then the class ci is more likely than C2- 

We define the presence, or absence, of interactions among the attributes as 
follows. If the evidence function can be written as a (“voting”) sum: 



f{C,XuX2,...,Xk)=v 



ei(C,W) 






( 1 ) 



for some voting function v, and myopic predictor functions ei, 62, ... , Ck, then 
there is no interaction between the attributes. Equation (I) requires that the 
joint evidence of all the attributes can essentially be reduced to the sum of 
the pieces of evidence ei{C,Xi) from individual attributes. The function Ci is 
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a predictor that investigates the relationship between an attribute Xi and the 
class C. 

If, on the other hand, no such functions v,e\,e 2 , ■ ■ ■ ,e.k exist for which (1) 
holds, then there are interactions among the attributes. The strength of inter- 
actions IS can be defined as 

/5:=/(C,Xi,X2,...,Xfc)-n . (2) 

IS greater than some positive threshold would indicate a positive interaction, 
and IS less than some negative threshold would indicate a negative interaction. 
Positive interactions indicate that a holistic view of the attributes unveils new 
evidence. Negative interactions are caused by multiple attributes providing the 
same evidence, while the evidence should count only once. 

Many classifiers are based on the linear form of (1): naive Bayesian classifier, 
logistic and linear regression, linear discriminants, support vector machines with 
linear kernels, and others. Hence, interaction analysis is relevant for all these 
methods. All we have written about relationships between attributes also carries 
over to relationships between predictors in an ensemble. 

3 Interaction Gain: A Heuristic for Detecting Interactions 

The above definition of an interaction provides a “golden standard” for deciding, 
in principle, whether there is interaction between two attributes. The definition 
is, however, hard to use as a procedure for detecting interactions in practice. Its 
implementation would require combinatorial optimization. 

We will not refine the above definition of interactions to make it applicable 
in a practical learning setting. Instead, we propose a heuristic test, called inter- 
action gain, for detecting positive and negative interactions in the data, in the 
spirit of the above definition. Our heuristic will be based on information-theoretic 
notion of entropy as the measure of classifier performance, joint probability dis- 
tribution as the predictor, and the chain rule as the voting function. Entropy 
has many useful properties, such as linear additivity of entropy with independent 
sources. We will consider discriminative learning, where our task is to study the 
class probability distribution. That is why we will always investigate relation- 
ships between an attribute and the class, or between attributes with respect to 
the class. 

Interaction gain is based on the well-known idea of information gain. Infor- 
mation gain of a single attribute X with respect to class C, also known as mutual 
information between X and C, measured in bits: 

Gainc(A) = J(A;C) = EE P{x,c) log • (3) 

Information gain can be regarded as a measure of the strength of a 2-way inter- 
action between an attribute X and the class C. In this spirit, we can generalize 
it to 3- way interactions by introducing the interaction gain [1]: 
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/(X; y; C) := I{X, F; C) - I{X; C) - I{Y- C). (4) 

Interaction gain is also measured in bits, and can be understood as the difference 
between the actual decrease in entropy achieved by the joint attribute XY and 
the expected decrease in entropy with the assumption of independence between 
attributes X and Y . The higher the interaction gain, the more information was 
gained by joining the attributes in the Cartesian product, in comparison with the 
information gained from single attributes. When the interaction gain is negative, 
both X and Y carry the same evidence, which was consequently subtracted twice. 

To simplify our understanding, we can use the entropy H{X) to measure 
the uncertainty of an information source X through the identity I{X;Y) = 
H{X) + H(Y) — H{X,Y). It it then not difficult to show that I{X;Y;C) = 
I{X; Y\C) - I{X- Y). Here, I{X; Y\C) = H{X\C) + H{Y\C) - H{X, Y\C) is 
conditional mutual information, a measure of dependence of two attributes given 
the context of C. I{X; Y) is an information-theoretic measure of dependence or 
“correlation” between the attributes X and Y regardless of the context. 

Interaction gain (4) describes the change in a dependence of a pair of at- 
tributes X, Y by introducing context C. It is quite easy to see that when inter- 
action gain is negative, context decreased the amount of dependence. When the 
interaction gain is positive, context increased the amount of dependence. When 
the interaction gain is zero, context did not affect the dependence between the 
two attributes. Interaction gain is identical to the notion of interaction informa- 
tion [2] and mutual information among three random variables [3,4]. 

4 Detecting and Resolving Interactions 

A number of methods have been proposed to account for dependencies in ma- 
chine learning, in particular with respect to the naive Bayesian classification 
model [5,6,7], showing improvement in comparison with the basic model. The 
first two of these methods, in a sense, perform feature construction; new features 
are constructed from interacting attributes, by relying on detection of interac- 
tions. On the other hand, tree augmentation [7], merely makes the dependence 
explicit, but this is more a syntactic distinction. In this section we experimen- 
tally investigate the relevance of interaction gain as a heuristic for guiding feature 
construction. 

The main questions addressed in this section are: Is interaction gain a good 
heuristic for detecting interactions? Does it correspond well to the principled 
definition of interactions in Section 2? 

The experimental scenario is as follows: 

1. We formulate an operational approximation to our definition of interaction. 
This is a reasonable and easy to implement special case of formula (1) as 
follows: the degree of evidence is a probability, and the formula (1) is in- 
stantiated to the naive Bayesian formula. It provides an efficient test for 
interactions. We refer to this test as BS. 
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2. In each experimental data set, we select the most interacting pair of at- 
tributes according to (a) BS, (b) positive interaction gain (PIG), and (c) 
negative interaction gain (NIG). 

3. We build naive Bayesian classifiers (NBG) in which the selected interactions 
are “resolved.” That is, the selected pair of most interacting attributes is 
replaced in NBG by its Gartesian product. This interaction resolution is done 
for the result of each of the three interaction detection heuristics (BS, PIG 
and NIG), and the performance of the three resulting classifiers is compared. 

We chose to measure the performance of a classifier with Brier score (de- 
scribed below) . We avoided classification accuracy as a performance measure for 
the following reasons. Glassification accuracy is not very sensitive in the context 
of probabilistic classification: it usually does not matter for classification accu- 
racy whether a classifier predicted the true class with the probability of 1 or 
with the probability of, e.g., 0.51. To account for the precision of probabilistic 
predictions, we employed Brier score. Given two probability distributions, the 
predicted class probability distribution p, and the actual class probability distri- 
bution p, where the class can take N values, the Brier score [8] of the prediction 
is: 

1 ^ 

b{P^P)-=Y^^{Pi-Pif ( 5 ) 

' i=l 

The larger the Brier score, the worse a prediction. Error rate is a special case of 
Brier score for deterministic classifiers, while Brier score could additionally re- 
ward a probabilistic classifier for better estimating the probability. In a practical 
evaluation of a classifier given a particular testing instance, we approximate the 
actual class distribution by assigning a probability of 1 to the true class of the 
testing instance. For multiple testing instances, we compute the average Brier 
score. 

We used two information-theoretic heuristics, based on interaction gain: 
the interaction with the maximal positive magnitude (PIG), and the interac- 
tion with the minimal negative magnitude (NIG). We also used a wrapper- like 
heuristic: the interaction with the maximum improvement in the naive Bayesian 
classifier performance after merging the attribute pair, as measured with the 
Brier score or classification accuracy on the training set, h{N BC{C\X){C\Y)) — 
b{N BC{C\X,Y)), the first term corresponding to the independence-assuming 
naive Bayesian classifier and the second to the Bayesian classifier assuming de- 
pendence. This heuristic (BS) is closely related to the notion of mutual condi- 
tional information, which can be understood as the Kullback-Leibler divergence 
between the two possible models. 

As the basic learning algorithm, we have used the naive Bayesian classifier. 
After the most important interaction was determined outside the context of other 
attributes, we modified the NBG model created with all the domain’s attributes 
by taking the single most interacting pair of attributes and replacing them with 
their Gartesian product, thus eliminating that particular dependence. All the 
numerical attributes in the domains were discretized beforehand, and missing 
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values represented as special values. Evaluations of the default NBC model and 
of its modifications with different guiding heuristics were performed with 10- 
fold cross-validation. For each fold, we computed the average score. For each 
domain, we computed the score mean and the standard error over the 10 fold 
experiments. We performed all our experiments with the Orange toolkit [9]. 



Table 1. The table lists Brier scores obtained with 10-fold cross validation after re- 
solving the most important interaction, as assessed with different methods. A result is 
set in bold face if it is the best for the domain, and checked if it is within the standard 
error of the best result for the domain. We marked the artificial domains. 



domain 


NB 


PIG 


NIG 


BS 


lung 


0.230^ 


0.208 


0.247 


0.243 


soy-small 


0.016 


0.016 


0.016 


0.016 


ZOO 


0.018 


0.019^ 


0.018 


0.018 


lymph 


0.079^ 


0.094 


0.077^ 


0.075 


wine 


0.010 


0.010^ 


0.015 


0.014 


glass 


0.070 


0.071^ 


0.071^ 


0.073^ 


breast 


0.212 


0.242 


0.212,/ 


0.221,/ 


ecoli 


0.032 


O 

b 

CO 

CO 


0.039 


0.046 


horse-col 


0.108^ 


0.127 


0.106^ 


0.104 


voting 


0.089 


0.098 


0.089 


0.063 


monks ^ 


0.042 


0.027 


0.042 


0.027 


monkP 


0.175 


0.012 


0.176 


0.012 


monk2^ 


0.226^ 


0.223 


0.224^ 


0.226^ 



domain 


NB 


PIG 


NIG 


BS 1 


soy-large 


A> 

00 

o 

o 

b 


0.007 


o 

b 

o 

00 


> 

00 

o 

o 

b 


wisc-canc 


o 

b 

to 


0.023 


o 

b 

to 


o 

b 

to 


austral 


0.120^ 


0.127 


0.114 


0.116^ 


credit 


0.116,/ 


0.122 


0.111 


0.115,/ 


pima 


0.159,/ 


0.159^ 


0.158 


0.159,/ 


vehicle 


0.142 


0.136 


0.138 


0.127 


heart 


0.095 


0.098 


0.095^ 


> 

lO 

o 

b 


german 


0.173 


0.175^ 


0.174,/ 


0.175^ 


cmc 


0.199^ 


0.194 


0.195,/ 


0.198,/ 


segment 


0.016 


0.017 


0.017 


0.015 


krkp^ 


0.092 


O 

b 


0.088 


0.076 


mushroom 


0.002 


0.006 


0.002 


0.002 


adult 


0.119 


0.120 


0.115 


0.119 



In Table 1 we sorted 26 of the UCI KDD archive [10] domains according to 
the number of instances in the domain, from the smallest on the top left to the 
largest on the bottom right, along with the results obtained in the above manner. 
We can observe that in two domains resolution methods matched the original 
result. In 6 domains resolution methods worsened the results, in 10 domains the 
original performance was within a standard error of the best, and in 8 domains, 
the improvement was significant beyond a standard error. We can thus confirm 
that accounting for interactions in this primitive way did help in ~70% of the 
domains. 

If comparing different resolution algorithms, the Brier-score driven interac- 
tion detection was superior to either of the information-based heuristics PIG or 
NIG, achieving the best result in 11 domains. However, in only two domains, 
‘voting’ and ‘segment,’ neither of PIG and NIG was able to improve the result 
while BS did. Thus, information-theoretic heuristics are a reasonable and effec- 
tive choice for interaction detection, providing competitive results even if the BS 
heuristic had the advantage of using the same evaluation function as the final 
classifier evaluation. PIG improved the results in 5 natural domains, and in 4 
artificial domains. This confirms earlier intuitions that the XOR-type phenom- 
ena occur more often in synthetic domains. NIG provided an improvement in 7 
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domains, all of them natural. Negative interactions are generally very frequent, 
but probabilistic overfitting cannot always be resolved to a satisfactory extent 
by merely resolving the strongest interaction because the result is dependent on 
the balance between multiple negatively interacting attributes. 



Table 2. A summary of results shows that BS-driven interaction resolution provides 
the most robust approach, while AIG follows closely behind. 



times 


NB 


PIG 


NIG 


BS 


AIG 


best 


8 


8 


7 


11 


10 


good 


10 


7 


10 


11 


7 


bad 


8 


11 


9 


4 


9 



In Table 2, we summarize the performance of different methods, including 
AIG as a simple approach deciding between the application of NIG and PIG in 
a given domain. AIG suggests resolving the interaction with the largest absolute 
interaction gain. We can also observe that success is more likely when the do- 
main contains a large number of instances. It is a known result from statistical 
literature that a lot of evidence is needed to show the significance of higher-order 
interactions [11]. 

5 Visualization of Interactions 

The analysis of attribute relationships can be facilitated by methods of infor- 
mation visualization. We propose two methods for visualization of attribute in- 
teractions. Interaction dendrogram illustrates groups of mutually interacting at- 
tributes. Interaction graph provides detailed insight into the nature of attribute 
relationships in a given domain. 



5.1 Interaction Dendrograms 

Interaction dendrogram illustrates the change in dependence between pairs of 
attributes after introducing the context. The direction of change is not impor- 
tant: we will distinguish this later. If we bind proximity in our presentation to 
the change in level of dependence, either positive or negative, we can define the 
distance dm between two attributes X, Y as: 



dm{X, r) 



|/(X;r;C)|-i if \I{X-Y-C)\~^ < 1000, 
1000 otherwise. 



(6) 



Here, 1000 is a chosen upper bound as to prevent attribute independence from 
disproportionately affecting the graphical representation. To present the func- 
tion dm to a human analyst, we tabulate it in a dissimilarity matrix and apply 
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Fig. 1. An interaction dendrogram illnstrates which attributes interact, positively or 
negatively, in the ‘census /adult’ (left) and ‘cmc’ (right) data sets. We used the Ward’s 
method for agglomerative hierarchical clustering [12]. 



the techniques of hierarchical clustering or multi-dimensional scaling. Depen- 
dent attributes will hence appear close to one another; independent attributes 
will appear far from one another. This visualization is an approach to variable 
clustering, which is normally applied to numerical variables outside the context 
of supervised learning. Diagrams, such as those in Fig. 1, may be directly useful 
for feature selection: the search for the best model starts by only picking the 
individually best attribute from each cluster. We must note, however, that an 
attribute’s membership in a cluster merely indicates its average relationship with 
other cluster members. 



5.2 Interaction Graphs 

The analysis described in the previous section was limited to rendering the mag- 
nitude of interaction gains between attributes. We cannot use the dendrogram 
to identify whether an interaction is positive or negative, nor can we see the im- 
portance of each attribute. An interaction graph presents the proximity matrix 
better. To reduce clutter, only the strongest N interactions are shown, usually 
5 < N < 20. With an interactive method for graph exploration, this trick would 
not be necessary. We also noticed that the distribution of interaction gains usu- 
ally follows a Gaussian-like distribution, with only a few interactions standing 
out from the crowd, either on the positive or on the negative side. 

Each node in the interaction graph corresponds to an attribute. The infor- 
mation gain of each attribute is expressed as a percentage of the class entropy 
(although some other uncertainty measure, such as the error rate or Brier score, 
could be used in the place of class entropy), and written below the attribute 
name. There are two kinds of edges, bidirectional arrows and undirected dashed 
arcs. Arcs indicate negative interactions, implying that the two attributes pro- 
vide partly the same information. The amount of shared information, as a per- 
centage of the class entropy, labels the arc. Analogously, the amount of novel 
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Fig. 2. The four most informative attributes were selected from a real medical domain. 
In the interaction graph (left), the most important attribute A alone eliminates 78% 
of class entropy. The second most important attribute B alone eliminates 76% of class 
entropy, but A and B interact negatively (dashed arc), and share 75% of class entropy. 
So B reduces class entropy by only 76-75=1% of its truly own once we have accounted 
for A\ but if we leave B out in feature subset selection, we are giving this information up. 
Similarly, C provides 4% of its own information, while the remaining 13% is contained 
in both, A and B. Attribute D provides ‘only’ 16% of information, but if we account 
for the positive interaction between A and D (solid bidirectional arrow) , we provide for 
78-1-16-1-6=100% of class entropy. Consequently, only attributes A and D are needed, 
and they should be treated as dependent. A Bayesian network [14] learned from the 
domain data (right) is arguably less informative. 



information labels the arrow, indicating a positive interaction between a pair of 
attributes. Figure 2 explains the interpretation of the interaction graph, while 
Figs. 3 and 4 illustrate two domains. We used the ‘dot’ utility [13] for generating 
the graph. 

6 Implications for Classification 

In discussing implications of interaction analysis for classification, there are two 
relevant questions. The first is the question of significance: when is a particular 
interaction worth considering. The second is the question of how to treat negative 
and positive interactions between attributes in the data. 

In theory, we should react whenever the conditional mutual information 
I{X; Y\C) deviates sufficiently from zero: it is a test of conditional dependence. 
In practice, using a joint probability distribution for XY would increase the 
complexity of the classifier, and this is often not justified when the training 
data is scarce. Namely, introducing the joint conditional probability distribution 
P{X, Y\C) in place of two marginal probability distributions P{X\C)P{Y\C) in- 
creases the degrees of freedom of the model, thus increasing the likelihood that 
the fit was accidental. In the spirit of Occam’s razor, we should increase the com- 
plexity of a classifier only to obtain a significant improvement in classification 
performance. Hence, the true test in practical applications is improvement in 
generalization performance, measured with devices such as the training/testing 
set separation and cross-validation. When the improvement after accounting for 
an interaction is significant, the interaction itself is significant. 
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Fig. 3. An interaction graph for the ‘census/adult’ domain confirms our intuitions 
about natural relationships between the attributes. All interactions in this graph are 
negative, but there are two clusters of them. 




Fig. 4. In this illustration of the ‘horse colic’ domain, one attribute appears to moderate 
a number of other attributes’ relationships with the class. There is a separate and 
independent negative interaction on the right. 



6.1 Negative Interactions 

If X and Y are interacting negatively, they both provide the same information. 
If we disregard a negative interaction, we are modifying the class probability 
distribution with the same information twice. If the duplicated evidence is bi- 
ased towards one of the classes, this may shift the prediction. The estimated 
class probabilities may become excessively confident for one of the classes, an- 
other case of overfitting. Unbalanced class probability estimates by themselves 
do not necessarily bother several classification performance measures, such as 
the classification accuracy and the ROC, because they do not always change the 
classifications. 

Even if the naive Bayesian classifier sometimes works optimally in spite of 
negative interactions, it often useful to resolve them. The most frequent method 
is feature selection. However, we observe that two noisy measurements of the 
same quantity are better than a single measurement, so other approaches may 
be preferable. One approach is assigning weights to attributes, such as feature 
weighting or least-squares regression. Alternatively, a latent attribute L can be 
inferred, to provide evidence for all three attributes: X, Y and C. The trivial ap- 
proach to latent attribute inference is the introduction of the Cartesian product 
between attributes, the technique we applied in our experiments, but methods 
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like factor analysis, independent component analysis are also applicable here. 
This kind of attribute dependence is a simple and obvious explanation for con- 
ditional dependence. Negative interactions allow us to simplify the model in the 
sense of reducing the quantity of evidence being dealt with. 

6.2 Positive Interactions 

The second cause of independence assumption violation is when two attributes 
together explain more than what we estimated from each attribute individually. 
There could be some unexplained moderating effect of the first attribute onto 
the second attribute’s evidence for C. There could be a functional dependence 
involving X, Y and C, possibly resolved by feature construction. Such positive 
interactions can be inferred from a positive value of interaction gain. Positive in- 
teractions are interesting subjects for additional study of the domain, indicating 
complex regularities. They too can be handled with latent attribute inference. 
Positive interactions indicate a possible benefit of complicating the model. 

If we disregard a positive interaction by assuming attribute independence, 
we are not taking advantage of all the information available: we are underfitting. 
Of course, one should note that the probabilities, on the basis of which entropy 
is calculated, might not be realistic. Since the probabilities of a joint probability 
distribution of two values of two attributes and the class are computed with fewer 
supporting examples than those computed with only one attribute value and the 
class, the 3-way interaction gains are less trustworthy than 2-way interaction 
gains. Consequently, positive interaction gains may in small domains indicate 
only a coincidental regularity. Taking accidental dependencies into consideration 
is a well-known cause of overfitting, but there are several ways of remedying this 
probability estimation problem, e.g. [15]. 

7 Conclusion 

In this paper we studied the detection and resolution of dependencies between 
attributes in machine learning. First we formally defined the degree of interaction 
between attributes through the deviation of the best possible “voting” classifier 
from the true relation between the class and the attributes in a domain. Then 
we proposed the interaction gain as a practical heuristic for detecting attribute 
interactions. We experimentally investigated the suitability of interaction gain 
(IG) for handling attribute dependencies in machine learning. Experimental re- 
sults can be summarized as follows: 

— IG as a heuristic for detecting interactions performs similarly as the BS 
criterion (a heuristic that was directly derived from the principled formal 
definition of attribute interaction), and enables the resolution of interactions 
in classification learning with similar performance as BS. 

— IG enables the distinction between positive and negative interactions while 
BS does not distinguish between these two types of interactions. Here, IG 
can explain the reason for a certain recommendation of BS. 
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— According to empirical results in real-world domains, strong positive inter- 
actions are rare, but negative interactions are ubiquitous. 

— In typical artificial domains, strong interactions are more frequent, particu- 
larly positive interactions. The IG heuristic reliably detects them. 

We also presented visualization methods for graphical exploration of interactions 
in a domain. These are useful tools that should help expert’s understanding of 
the domain under study, and could possibly be used in constructing a predic- 
tive model. Problems for future work include: handling n-way interactions where 
n > 3; building learning algorithms that will incorporate interaction detection 
facilities, and provide superior means of resolving these interactions when build- 
ing classifiers. 
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Abstract. Application domains such as life sciences, e.g. molecular bi- 
ology produce a tremendous amount of data which can no longer be man- 
aged without the help of efficient and effective data mining methods. One 
of the primary data mining tasks is clustering. However, traditional clus- 
tering algorithms often fail to detect meaningful clusters because of the 
high dimensional, inherently sparse feature space of most real-world data 
sets. Nevertheless, the data sets often contain clusters hidden in various 
subspaces of the original feature space. We present a pre-processing step 
for traditional clustering algorithms, which detects all interesting sub- 
spaces of high-dimensional data containing clusters. For this purpose, we 
define a quality criterion for the interestingness of a subspace and pro- 
pose an efficient algorithm called RIS (Ranking interesting Subspaces) 
to examine all such subspaces. A broad evaluation based on synthetic 
and real-world data sets empirically shows that RIS is suitable to find 
all relevant subspaces in large, high dimensional, sparse data and to rank 
them accordingly. 



1 Introduction 

The tremendous amount of data produced nowadays in various application do- 
mains such as molecular biology can only be fully exploited by efficient and 
effective data mining tools. One of the primary data mining tasks is clustering 
which is the task of partitioning objects of a data set into distinct groups (clus- 
ters) such that two objects from one cluster are similar to each other, whereas 
two objects from distinct clusters are not. 

Considerable work has been done in the area of clustering. Nevertheless, clus- 
tering real-world data sets often raises problems, since the data space is usually 
a high dimensional feature space. A prominent example is the application of 
cluster analysis to gene expression data. Depending on the goal of the applica- 
tion, the dimensionality of the feature space can be up to 10^ when clustering 
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the genes and can be in the range of 10^ to more than 10"^ when clustering the 
samples. In general, most of the common clustering algorithms fail to generate 
meaningful results because of the inherent sparsity of the data space. In such 
high dimensional feature spaces data does not cluster anymore. But usually, 
there are clusters in lower dimensional subspaces. In addition, objects can often 
be clustered differently in varying subspaces, i.e. objects may be grouped with 
different objects when subspaces vary. Again, gene expression data is a promi- 
nent example. When clustering the genes to detect co-regulated genes, one has 
to cope with the problem, that usually the co-regulation of the genes can only be 
detected in subsets of the samples (attributes). In other words, different subsets 
of the samples are responsible for different co-regulations of the genes. When 
clustering the samples this situation is even worse. As different phenotypes are 
hidden in varying subsets of the genes, the samples could usually be clustered 
differently according to various phenotypes, i.e. in varying subspaces. 

1.1 Related Work 

A common approach to cope with the curse of dimensionality for data mining 
tasks are dimensionality reduction or methods. In general, these methods map 
the whole feature space onto a lower-dimensional subspace of relevant attributes, 
using e.g. principal component analysis (PCA) and singular value decomposition 
(SVD). However, the transformed attributes often have no intuitive meaning 
any more and thus the resulting clusters are hard to interpret. In some cases, 
dimensionality reduction even does not yield the desired results (e.g. [1] presents 
an example where PCA does not reduce the dimensionality). In addition, using 
dimensionality reduction techniques, the data is clustered only in a particular 
subspace. The information of objects clustered differently in varying subspaces 
is lost. This is also the case for most common feature selection methods. 

A second approach for coping with clustering high-dimensional data is pro- 
jected clustering, which aims at computing k pairs (Ci, S'i)(o<i<fc) where Ci is 
a set of objects representing the i-th cluster, Si is a set of attributes spanning 
the subspace in which Ci exists (i.e. optimizes a given clustering criterion), and 
fc is a user defined integer. Representative algorithms include the fc-means re- 
lated PROCLUS [2], ORCLUS [3] and the density-based approach OptiGrid 
[4]. While the projected clustering approach is more flexible than dimensionality 
reduction, it also suffers from the fact that the information of objects which 
are clustered differently in varying subspaces is lost. Figure 1(a) illustrates this 
problem using a feature space of four attributes A,B,C, and D. In the subspace 
AB the objects 1 and 2 cluster together with objects 3 and 4, whereas in the 
subspace CD they cluster with objects 5 and 6. Either the information of the 
cluster in subspace AB or in subspace CD will be lost. 

The most informative approach for clustering high-dimensional data is sub- 
space clustering which is the task of automatically identifying (in general several) 
subspaces of a high dimensional data space that allow better clustering of the 
data objects than the original space [1]. One of the first approaches to subspace 
clustering is CLIQUE [1], a grid-based algorithm using an Apr^ori-\\\^e method 




Ranking Interesting Subspaces for Clustering High Dimensional Data 243 




Fig. 1. Drawbacks of existing approaches (see text for explanation). 



to recursively navigate through the set of possible subspaces in a bottom-up 
way. The dataspace is first partitioned by an axis-parallel grid into equi-sized 
blocks of width ^ called units. Only units whose densities exceed a threshold 
T are retained. Both ^ and r are the input parameters of CLIQUE. A cluster 
is defined as a maximal set of connected dense units. Successive modifications 
of CLIQUE include ENCLUS [5] and MAFIA [6]. But the information gain of 
these approaches is also sub-optimal. As they only provide clusters and not com- 
plete partitionings of some subspaces, we do not get the information in which 
subspaces the whole dataset clusters best. Another drawback of these methods 
is caused by the use of grids. In general, grid-based approaches heavily depend 
on the positioning of the grids. Clusters may be missed if they are inadequately 
oriented or shaped. Figure 1(b) illustrates this problem for CLIQUE: Each grid 
by itself is not dense, if r > 4, and thus, the cluster C is not found. On the other 
hand if t = 4, the cell with four objects in the lower right corner just above the 
x-axis is reported as a cluster. 

Another recent approach called DOC [7] proposes a mathematical formu- 
lation for the notion of an optimal projected cluster, regarding the density of 
points in subspaces. DOC is not grid-based but as the density of subspaces is 
measured using hypercubes of fixed width w, it has similar problems drafted 
in Figure 1(c). If a cluster is bigger than the hypercube, some objects may be 
missed. Furthermore, the distribution inside the hypercube is not considered, 
and thus it need not necessarily contain only objects of one cluster. 

1.2 Contributions 

In this paper, we propose a new approach which eliminates the problems men- 
tioned above and enables the user to gain all the clustering information contained 
in high-dimensional data. We present a preprocessing step, which selects all in- 
teresting subspaces using a density-connected clustering notion. Thus we are 
able to detect all subspaces containing clusters of arbitrary size and shape. We 
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first define the “interestingness” of subspaces in Section 2 and provide a quality 
criterion to rank the subspaces according to their interestingness. Afterwards 
any traditional clustering algorithm (e.g. the one the user is accustomed to) can 
be applied to these subspaces. In Section 3, we present an efficient density-based 
algorithm called RIS (Ranking Interesting Rubspaces) for computing all those 
subspaces. A broad experimental evaluation of RIS based on artificial as well as 
on gene expression data is presented in Section 4. Section 5 draws conclusions. 

2 Ranking Interesting Snbspaces 

2.1 Preliminary Definitions 

Let DB be a data set of n objects with dimensionality d. We assume, that DB 
is a database of feature vectors {DB C M‘^). All feature vectors have normalized 
values, i.e. all values fall into [0, attr Range] for a fixed attrRange G IR^. Let 
A = {ai,...,ad} be the set of all attributes Ui of DB. Any subset RCA, 
is called a subspace. The projection of an object o into a subspace S' C A is 
denoted by 7 Ts(o). The distance function is denoted by dist. We assume that 
dist is one of the Lp-norms. The £-neighborhood of an object o is defined by 
Ne{o) = {x £ DB I dist{o,x) < e}. The e-neighborhood of an object in a 
subspace S C A is denoted by Aff{o) := {a: G DB \ dist{Trs{o),TTs{x)) < e}. 

2.2 Interestingness of a Subspace 

Our approach to rate the interestingness of subspaces is based on a density- 
based notion of clusters. This notion is a common approach for clustering used by 
various clustering algorithms such as DBSCAN [8], DENCLUE [9], and OPTICS 
[10]. All these methods search for regions of high density in a feature space that 
are separated by regions of lower density. We adopt the notion of [8] to define 
“dense regions” by means of core-objects: 

Definition 1. (Core-Object) 

Let £ £ M and MinPts £ IN. An object o is called core object if \Af^{o) \ > 
MinPts. 

The core-object property is the key concept of the formal density-connected 
clustering notion in [8]. This property can also be used for deciding about the 
interestingness of a subspace. Obviously, if a subspace contains no core-object, it 
contains no dense region (cluster) and therefore contains no relevant information. 

Observation 1. The number of core-objects of a dataset DB (wrt. e and 
MinPts) is proportional to the number of different clusters in DB and/or the 
size of the clusters in DB and/or the density of clusters in DB. 

This observation can be used to rate the interestingness of subspaces. How- 
ever, simply counting all the core objects for each subspace delivers not enough 
information. Even if two subspaces contain the same number of core-objects the 
quality may differ a lot. Dense regions contain objects which are no core-objects 
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but lie within the £-neighborhood of a core-object and are thus a vital part of 
the dense region. Therefore, it is not only interesting how many core-objects a 
subspace contains but also how many objects lie within the e-neighborhood of 
these core-objects. In the following the variable count[S] denotes the sum of all 
points lying in the e-neighborhood of all core-objects in the subspace S. The 
number of core-objects of S is denoted by core [S']. If we measure the interest- 
ingness of a subspace according to its count[S] value and rank all subspaces 
according to this quality value, two problems are not adressed. Since naturally 
with each dimension the number of expected objects in the e-neighborhood of 
an object decreases, this naive quality value favors lower dimensional subspaces 
over higher dimensional ones. To overcome this problem we introduce a scaling 
coefficient that takes the dimensionality of the subspace into account. We take 
the ratio between the count[S] value and the count[S] value we would get if all 
data objects were uniformly distributed in S. For that purpose, we compute the 
volume of a d-dimensional e-neighborhood denoted by Volf and the number of 
objects lying in Volf assuming uniform distribution. 



Definition 2. 

is defined by: 



The quality of a subspace S, measuring the interestingness of S 



Quality ('5^ 



n 



rt[S'j 



\dim[S] 



voi: 

attrRange'^^ 



■n 



If dist is the Loo-norm, Volf is a hypercube and can be computed by Volf = 
(2e)‘^, or if dist is the Euclidian distance (i 2 -norm) VoUf is a hypersphere and 
can be computed as given below: 



Votf 



r{d/2 + i) 



• £ 



where T(a; -I- 1) = a; • T(a;), T(l) = 1 and T(i) = ^/tt. 

The second problem is the phenomenon that in high-dimensional spaces 
more and more points are located on the boundary of the data space. The e- 
neighborhoods of these objects are smaller because they exceed the borders of 
the data space. In [II] the authors show that the average volume of the inter- 
section of the data space and a hypersphere with radius e can be expressed as 
the integral of a piecewise defined function integrated over all possible positions 
of the e-neighborhood, i.e the core-objects. For our implementation we choose a 
less complex heuristics to eliminate this effect based on periodical extensions of 
the data space (cf. Section 3.2 for details). 

For two arbitrary subspaces U,V G this quality criterion has two com- 
plementary effects which are summerized in the following observation: 



Observation 2. Let U D V . Then the following inequalities hold: 

1. core[U] < core\V] and count\U] < countfV]. 

2. If core[U] = core[V] and count[U] = count[V] 
then Quality([/) > QuALiTY(y). 
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Fig. 2. Visualisation of Lemma 1 for MinPts = 5 (2D featnre space). 



The first observation states that, while navigating through the subspaces 
bottom-up, at a certain point the core-objects loose their core-object property 
due to the addition of irrelevant features and thus the quality decreases. On the 
other hand, as long as this is not the case, the features are relevant for the cluster 
and the quality increases. 



2.3 General Idea of Finding Interesting Subspaces 

A straightforward approach would be to examine all possible subspaces (e.g. 
bottom-up). The problem is, that the number of subspaces is 2"^. Basically all 
subspaces that do not contain any core-object can be dropped since they cannot 
contain any clusters. Furthermore, the core-object condition is decreasing strictly 
monotonic: 



Lemma 1. (Monotonicity of Core-Object Condition) 

Let o G DB and S Q A he an attribute subset. If o is a core-object in S, then it 
is also a core-object in any subspace T C S wrt. e and MinPts, formally: 

VT C S' : |Af/(o)| > MinPts ^ |AfJ(o)| > MinPts. 

Proof. Va; G A/’g^(o) the following holds: 



dist{7Ts{o),Trs{x)) < £ 



E (7!"ai(o) - 7Ta,(x))P 



< £ 



TCS 



E (TTa 

at&T 



io) 



— TTai{x))P < £ dist{TTT{o) , TTt{x)) < £ => X G (o) 



It follows that |A/’7 ’(o)| > |A/’/(o)| > MinPts □ 

The Lemma is visualized in Figure 2(a). The reverse conclusion of Lemma 1 
is illustrated in Figure 2(b) and states: If an object o is not a core-object in T, 
then o is also not a core-object in any super-space S D T. 

The next sections will present in detail, how this property helps to eliminate 
a lot of subspaces in the process of generating all relevant subspaces in a bottom- 
up process. 
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RISCSetOf Objects, Eps, MinPts) 

Subspaces := emptySet ; 

FOR i FROM 1 TO SetOf Objects . sizeO DO 
Object := SampleObjects . get (i) ; 

RelevantSubspaces := GenerateSubspaces (Object , SetOf Objects) ; 
Subspaces . add(RelevantSubspaces) ; 

END FOR 

Subspaces .prune 0 ; 

Subspaces . sort () ; 

END //RIS 



Fig. 3. The RIS algorithm. 



3 Implementation of RIS 

3.1 Algorithm 

Given a set of objects DB and density parameters e and MinPts, RIS finds 
all interesting subspaces and presents them to the user ordered by relevance. 
For each object, RIS computes a set of relevant subspaces. All these sets are 
then merged. A pruning and sorting procedure is applied to the resulting set of 
subspaces. The pseudocode of the algorithm RIS is given in Figure 3. For each 
object o G DB, all subspaces in which the core-object condition holds for o, are 
computed. This step will be described in detail in Section 3.2. Let us note that the 
algorithm can also be applied to a sample of DB, e.g. for performance reasons 
(cf. Section 4.3). For each detected subspace, statistical data is accumulated. 
The detected subspaces are pruned according to certain criteria. In Section 3.3, 
these criteria will be discussed. Finally, the subspaces are sorted for a more 
comprehensible user presentation. The clustering in these subspaces can then be 
done by any clustering algorithm. 

3.2 Efficient Generation of Subspaces 

For a given object o G DB, the method GenerateSubspaces finds all subspaces 
S in which the core-object condition holds wrt. e and MinPts. Formally, it com- 
putes the following set: Ko := {T C A\ \Af^ {o)\ > MinPts}. 

The problem of finding the set Ko is equivalent to the problem of determining 
all frequent itemsets in the context of mining association rules [12] when using 
the Loo-norm as distance function and thus can be computed rather efficiently^: 
For each x G DB a transaction C A is defined, such that, 

Oi G Ta; I 7Ta, (x) - 7Ta, (o) | < £ for all i G {1, ... , d] . 



^ Let us note that the use of Loo-norm is no serious constraint. The only difference is 
that by using the Loo norm we may find additional core-objects and thus additional 
subspaces. However, these additional subspaces get low quality values anyway. 
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Lemma 2. 



Ko = {T <ZA \ Supp£,s(T) > 



where Supp£,g(T) = 



DB\ 

\{x€DB\TQT^}\ 

\DB\ 



Proof. T C A A |-/Vj(o)| > MinPts 

Q A A\{x € DB I dist^^ {ttt{o),t^t{x)) < £}| > MinPts 
T fl A A 

|{x G DB \ yi G {1, d} : Oi € T ^ koi(o) — T^aii,x)\ < £}| > MinPts 
^ T C A A \{x G DB \ T C T4\ > MinPts ^TCAA Supp£,s(T) > 

□ 



The method GenerateSubspaces extends the familar Apriori [12] algorithm 
in accumulating the statistical information for measuring the subspace quality 
using the monotonicity of the core-object condition (cf. Lemma 1). As men- 
tioned before, we are extending the data space periodically to ensure that all 
£-neighborhoods have the same size. This can be done very easily by changing the 
way the transactions are defined. Instead of only checking if \T^ai{x) —TTai{o)\ < e 
we have to check if [TTa. (x) — 7ra^(o)| < £ or |7 To^(x) — Tr^. (o)| > attrRange — e. 



3.3 Pruning of Subspaces 

As we are only interested in the subspaces which provide the most information, 
we can perform the following downward pruning step to eliminate redundant 
subspaces: If there exists a (fc-l- l)-dimensional subspace S, with higher quality 
than the fc-dimensional subspace T (S' D T), we delete T. 

For the second pruning, we assume, that for a given data set the fc-dimensional 
subspace S refiects the clustering in that special data set in a best possible 
way. Thus, its quality value and the quality values of all its (k — 1 (-dimensional 
subspaces Ti, . . . ,Tm is high. On the other hand, if we combine one of these 
(fc — 1 (-dimensional subspaces Ti, . . . ,Tm with another 1-dimensional subspace 
with lower quality, the quality of the resulting fc-dimensional subspace can still 
be good. But as we know that it does not refiect the clustering in a best possible 
way, we are not interested in this fc-dimensional subspace. The following heuristic 
upward pruning eliminates such subspaces. Let S' be a fc-dimensional attribute 
space and Sk-i := {T | T C S A dim[T] = fc — 1} be the set of all {k — 1(- 
dimensional subspaces of S. Let count be the mean count value of all T G Sk-i 
and s be the standard deviation. Let maxdijf:= ( I count\T] — count\ ( be 

the maximum deviation of the count-values of all T G Sk-i from the mean count- 
value. Then, the so-called &ms-value can be computed as follows: bias = . 

If this bias-value falls below a certain threshhold, we prune the fc-dimensional 
subspace S. Experimental evaluations indicate that 0.56 is a good value for this 
bias-criterion. 
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3.4 Determination of Density Parameters 



A heuristic method, which is experimentally shown to be sufficient, suggests 
MinPts « ln(n) where n is the size of the database. Then, e must be picked 
depending on the value of MinPts. In [8] a simple heuristics is presented to 
determine the e of the ’’thinnest” cluster in the database (for a given MinPts). 
But as we do not know beforehand in which subspaces clusters will be found, we 
cannot determine e to find a single subspace with one particular clustering. Quite 
the contrary, we want to choose the parameters such that RIS detects subspaces 
which might have clusters of different density and different dimensionality. 

However, we can determine an upper bound for e for a given value of MinPts. 
If we take uniform distribution as worst case, the £-neighborhood of an object 
should not contain more than MinPts — 1 objects in the full-dimensional space. 
Otherwise all objects are core-objects. In case of the Loo-norm an upper bound 
for £ can be computed as follows: 



Volt 

attrRange'^^”^ 



< MinPts 



£ < 



attr Range 



MinPts 



where dim = d. If we have any knowledge about the dimensionality of the 
subspaces we want to find, we can further decrease the upper bound by setting 
dim to the highest dimension of such a subspace. 

This upper bound is very rough. Nevertheless, it provides a good indication 
for the choice of e. Indeed, it empirically turned out, that upperbound/A is a 
reasonable choice for £. Experiments on synthetic data sets show, that our sug- 
gested criteria for the choice of the density parameters are sufficient to detect 
the relevant subspaces containing clusters. 



4 Performance Evaluation 

We tested RIS using several synthetic as well as a real-world data set. The 
experiments were run on a workstation with a 1.7 GHz CPU and 2 GB RAM. 

The synthetic data sets were generated by a self-implemented data generator. 
It permits to control the size and structure of the generated data sets through 
parameters such as number and dimensionality of subspace clusters, dimension- 
ality of the feature space and density parameters for the whole data set as well 
as for each cluster. In a subspace that contains a cluster the average density of 
data points in that cluster is much larger than the density of points not belong- 
ing to the cluster in this subspace. In addition, it is ensured, that none of the 
synthetically generated data sets can be clustered in full dimensional space. 

The real world data set is the well-studied gene expression data set of Spell- 
man et al. [13] analyzing the yeast mitotic cell cycle. We only chose the data 
of the cdcl5 mutant and eliminated all genes having missing attribute values. 
The resulting test data set consists of approximately 4400 genes expressed at 24 
different time spots. 
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A subsequent clustering of the data sets in the detected subspaces was per- 
formed for each experiment using the above mentioned algorithm OPTICS to 
validate the interestingness of the subspaces computed by RIS. 

4.1 Effectiveness Evaluation 

Synthetic Data Sets. We evaluated the effectiveness of RIS using several 
synthetic data sets of varying dimensionality. The data sets contained between 
two and five overlapping clusters in varying subspaces. In all experiments, RIS 
detected the correct subspaces in which clusters exist and assigned the highest 
quality values to them. All higher dimensional subspaces which were generated, 
were removed by the upward pruning procedure. 

Gene Expression Data. We also applied RIS to the above described gene 
expression data set. A clustering using OPTICS in the two top-ranked subspaces 
provided several clusters. The first subspace spanned by the time spots 90, 110, 
130, and 190 contains three biologically relevant clusters with several genes play- 
ing a central role during mitosis^. For example, cluster 1 consists of the genes 
CDC25 (starting point for mitosis), MY03 and NUDl (known for an active 
role during mitosis) and various other transcription factors (e.g. CHA4, ELP3) 
necessary during the cell cycle. Cluster 2 contains the gene STE12, identified 
by [13] as an important transcription factor for the regulation of the cell cy- 
cle. In addition, the genes CDC27 and EMP47 which have possible STE12-sites 
and are most likely co-regulated with STE12 are in that cluster. The cluster 
is completed by several transcription factors (e.g. XBPl, SSLl). Cluster 3 also 
consists of several genes which are known to play a role during the cell cycle 
such as DOM34, CKAl, CPAl, and MIP6. The second subspace is spanned by 
the time spots 190, 270 and 290 and consists of three clusters that have sim- 
ilar characteristics to those of the first subspace. In addition, a fourth cluster 
contains several mitochondrion related genes which have similar functions and 
are therefore most likely co-regulated, indeed. For example, the genes MRPL17, 
MRPL31, MRPL32, and MRPL33 are four mitochondrial large ribosomal sub- 
units, the genes UBCl and UBC4 are subunits of a certain protease, the genes 
SNF7 and VPS4 are direct interaction partners, and several other genes that 
code for mitochondrial proteins (e.g. MEFl, PHBl, CYCl, MGEl, ATP12). 
This indicates a higher mitochindrial activity at these time spots, which could 
be explained by a higher demand of biological energy during the cell cycle (the 
energy metabolism is located in mitochondrions). In summary, RIS detects two 
subspaces containing several biologically relevant co-regulations. 

4.2 Efficiency Evaluation 

The results of the efficiency evaluation are depicted in Figure 4. This evaluation 
is based on several synthetic data sets. The experiments were run with MinPts = 
ln(n) and e choosen as suggested in Section 3.4. All run times are in seconds. 

^ The analysis of the clusters is partly based on the Saccharomyces Genome Database 
(SGD), available at: http://genome-www.stanford.edu/Saccharomyces/ 
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Fig. 4. Efficiency evaluation. 

RIS scales well to the dimensionality of the relevant subspaces. With increas- 
ing dimensionality of the relevant subspaces, the runtime of RIS grows with a 
linear factor. On the other hand, the scalability of RIS to the size n and the 
dimensionality d of the input data set is not linear. With increasing n and d, 
the runtime of RIS grows with an at least quadratic factor for rather large n 
and d, respectively. The reason for this scalability vs. the size n is that RIS 
performs multiple range-queries without any index support, due to the fact that 
the e-neighborhoods of all points in arbitrary subspaces have to be computed. 
However, there is no index structure to efficiently support range queries in arbi- 
trary subspaces. The observed scalability with respect to d can be explained by 
the Hprzorz-like navigation through the search space of all subspaces. 

4.3 Speed-up for Large Data Sets 

Since the runtime of RIS is rather high especially for large data sets, we applied 
random sampling to accelerate our algorithm. Figure 4 shows that for a large 
data set of n = 750, 000 data objects, sampling yields a rather good speed-up. 
The data set contained two overlapping four-dimensional subspace clusters, con- 
taining approximately 400,000 and 350,000 points. Even using only 100 sample 
points, RIS had no problem to detect the subspaces of these two clusters. For 
all sample sizes, these subspaces had by far the highest quality values. Further 
experiments empirically show, that random sampling can be successfully applied 
to RIS in order to speed-up the runtime of this algorithm paying a minimum 
loss of quality. 

5 Conclusions 

In this paper, we introduced a preprocessing step for clustering high-dimensional 
data. Based on a quality criterion for the interestingness of a subspace, we pre- 
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sented an efficient algorithm called RIS to compute all interesting subspaces 
containing dense regions of arbitrary shape and size. Furthermore, the well- 
established technique of random sampling can be applied to RIS in order to 
speed-up the runtime of the algorithm significantly with a minimum loss of 
quality. The effectiveness evaluation shows that RIS can be succesfully applied 
to high-dimensional real-world data, e.g. on gene expression data in order to find 
co-regulated genes. 
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Abstract. The problem of efficiently finding patterns in massive time series da- 
tabases has attracted great interest, and, at least for the Euclidean distance 
measure, may now be regarded as a solved problem. However in recent years 
there has been an increasing awareness that Euclidean distance is inappropriate 
for many real world applications. The limitations of Euclidean distance stems 
from the fact that it is very sensitive to distortions in the time axis. A partial so- 
lution to this problem. Dynamic Time Warping (DTW), aligns the time axis be- 
fore calculating the Euclidean distance. However, DTW can only address the 
problem of local scaling. As we demonstrate in this work, uniform scaling may 
be just as important in many domains, including applications as diverse as bio- 
informatics, space telemetry monitoring and motion editing for computer ani- 
mation. In this work, we demonstrate a novel technique to speed up similarity 
search under uniform scaling. As we will demonstrate, our technique is simple 
and intuitive, and can achieve a speedup of 2 to 3 orders of magnitude under re- 
alistic settings. 



1 Introduction 

The problem of efficiently finding patterns in massive time series databases has at- 
tracted great interest in the database and data mining communities, and, at least for 
the Euclidean distance measure, may now be regarded as a solved problem [2, 5, 11, 
12]. However in recent years there has been an increasing awareness that Euclidean 
distance is inappropriate for many real world applications [1, 6]. The limitations of 
Euclidean distance stems from the fact that it is very sensitive to distortions in the 
time axis. A partial solution to this problem, Dynamic Time Warping (DTW), essen- 
tially aligns the time axis before calculating the Euclidean distance. Because of its 
well-documented lethargy, DTW was deemed impractical for large databases until a 
recent breakthrough demonstrated that DTW can be indexed [10]. DTW can only 
address the problem of local scaling, however uniform scaling may be just as impor- 
tant in many domains, including applications as diverse as bioinformatics, space te- 
lemetry monitoring and motion editing for computer animation. 
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There exists a handful of techniques that can support similarity search under uni- 
form scaling if the scaling factor is known in advance [3, 9]; however, in most do- 
mains it is unlikely that we know the scaling factor. In such instances we must resort 
to multiple queries, one for each possible scaling factor. Clearly, this is untenable for 
even moderately large databases. What we really need is a technique that can perform 
a single efficient query to retrieve all qualifying time series with any scaling. This is 
exactly the contribution of this paper. 

The rest of this paper is organized as follows. Section 2 carefully motivates the 
need for similarity search under uniform scaling, and reviews related work. In Section 
3 we introduce our approach to the problem. Section 4 contains an extensive empiri- 
cal evaluation on 5 real world datasets. Finally, Section 5 contains conclusions and 
directions for future work. 

2 Motivating the Need for Uniform Scaling 

In addition to the classic Euclidean and Dynamic Time Warping distance measures, 
the last decade has seen the introduction of dozens of new similarity measures for 
time series. Recent empirical studies, however, suggest that the majority of these 
measures are of dubious utility for real world problems [13]. We will therefore take 
the time to motivate the absolute need for uniform scaling in several real world appli- 
cations. 



2.1 Space Shuttle Telemetry Monitoring 

The Space Shuttle transmits thousands of sensor readings to Earth at Imhz or greater 
during flight. With over 100 missions, averaging 8.6 days in orbit, this massive re- 
pository of data constitutes a potential goldmine for engineers wishing understand 
and predict in-flight anomalies [4]. Consider an engineer wishing to discover all oc- 
currences of a “dipping” event. This event consists of a sudden positive change in 
yaw, followed by an auto correction by the Shuttle’s onboard flight guidance system. 
Such events can easily be visually located in a small time series, as they form a ‘V’ 
pattern. However, in a massive dataset we must resort to a computerized similarity 
search. 

If we create a ‘V’ shaped query that is 4 minutes long, and search using the Euclid- 
ean distance, we correctly find one true event as shown in Eig. 1 A. However, the 
second and third best matches fail to find the other two “dips”. In contrast, if we 
issue a query for all ‘V’ shaped patterns in the range of 4 minutes to 6 minutes, we 
can correctly discover all three such events as shown in Fig. 1 B. 



2.2 Gene Expression Data 

Recent advances in bioinformatics technology have resulted in an explosion of gene 
expression data to be analyzed [1]. Several of the most important tasks, such as clus- 
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tering, classification and missing value reconstruction, require similarity matching as 
a first step. Both Euclidean distance and DTW are used; however, we argue that uni- 
form scaling may be more useful for some tasks and datasets. Consider the two se- 
quences shown in Fig. 2. 
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Fig. 1. Eight hours of STS-57 Space Shuttle Inertial Sensor Data: A) A ‘V’ shaped query cor- 
rectly matches one steep valley in the data, but the second and third best matches fail to find the 
two other valleys because they happen more slowly. B) A ‘V’ shaped query that is allowed to 
rescale itself by up to 50% correctly finds the three valleys. The second and third best matches 
have a scaling factor (sf) of 1.12 and 1.14 respectively 





Fig. 2. Two yeast cell-cycle gene expression time series, from genes known to be functionally 
related. (Left) Using the original scale, the genes appear to be a poor match. (Right) If the 
shorter time series is rescaled by a scaling factor of 1.41, it becomes a high quality match to the 
“prefix” of the longer time series 



Although the two genes are known to be functionally related [1], the raw time series 
subjectively appear to be a poor match. Simply rescaling the shorter time series by a 
factor of 1.41 allows the underlying similarity to be more readily discovered. 

We considered other approaches for this problem. Euclidean distance is a very 
commonly used technique, but it is only defined for time series of the same length. 
One solution is to normalize the lengths with interpolation; another is to truncate the 
longer time series. Although DTW is defined for time series of different lengths, 
interpolation and truncation can also be useful here. In Fig. 3. we show all combina- 
tions of possibilities, none of them succeeds in capturing the underlying similarity of 
the data. 



2.3 Motion Capture Editing 

Motion capture data is increasingly used in video games, movie special effects and 
gait analysis [6]. The following is a classic problem in this domain. Given two exam- 
ples of a human performing a task, once slowly, and once quickly, interpolate the 
motion at any desired speed [20]. Figure 4 shows an example. The problem is non- 
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trivial because of non-linear effects in human dynamics. Nevertheless correctly align- 
ing the two time series from each instance is a critical first step in solving the prob- 
lem. This can be achieved manually for a simple movie special effect, but for real 
time video games, or complex effect shots (i.e, the battle scenes in The Lord of the 
Rings), automation is required. 




Fig. 3. None of the published alternatives to uniform scaling produce intuitive alignments 
between the two gene expression time series introduced in this section. Clockwise from the top 
left, DTW after tmncating the longer time series, classic DTW, DTW after length normaliza- 
tion, Euclidean distance after length normalization 




Fig. 4. {left) A computer animation of a boxer, driven by a motion capture system {center). 
Given that we have captured an example of a fast moment and a slow movement {right), an 
important problem in motion capture editing is to interpolate the movement at any desired 
speed. Aligning the signals with uniform scaling is a important first step in this process 

Having motivated the need for uniform scaling in several domains, we will next 
consider related work. 



2.4 Related Work 

The past decade has seen literally hundreds of papers on similarity search using the 
Euclidean distance [2, 5, 11, 12]; useful surveys can be found in [8] and [17]. How- 
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ever recent years have seen an increasing awareness that the Euclidean distance may 
be unsuitable for many applications [1, 10, 18, 19]. 

Many non Euclidean distance measures for time series have been introduced, how- 
ever, a recent empirical study suggests that most of them are of questionable utility 
[10]. The only non-Euclidean distance measure that has been forcefully shown to be 
superior to Euclidean distance is DTW, it’s utility has been demonstrated in domains 
as diverse as bioinformatics [1], chemical engineering, gait analysis [6], speech rec- 
ognition, meteorology, and robotics. However DTW only considers local stretching 
and shrinking of the time axis. As we demonstrated in the previous section, uniform 
scaling may be equally important in many domains. 

The utility of uniform scaling has been noted before [9, 14, 15]. However, all pre- 
vious work has focused on speeding up similarity search, when the scaling factor is 
known. Eor example, there are systems that can index data of length 200, and support 
queries of any length from 150 to 200. However the user must specify what length 
query they wish to run, perhaps a query of length 175. If the user wishes to find the 
best matching time series, at any length from 150 to 200, they would have to run 
every possible query, of length 150, 151 ,..., 200 to find the answer. This is clearly 
untenable. As all these systems claim about one order of magnitude speed up, placing 
them in a loop and running them 50 times is clearly going to be self defeating. The 
feature that differentiates our work from all the rest is that we allow a user to issue a 
single query, and find the best match at any scaling. Our proposed technique is 
unique in this aspect. 



3 Uniform Scaling 

We begin by formally defining the uniform scaling problem. 

Suppose we have two time series, a query Q and a candidate match C, of length n 
and m respectively, where: 

Q = q„q,,...,q„...,q„ (1) 

C = Cj,c„...,c.,...,c„ (2) 

For clarity of presentation we will assume that n < m, that is to say, C is always 
longer than or equal to Q, and thus we are only interested in stretching the query to 
match some prefix of C. This assumption is only to simplify notion and does not 
preclude matching a time series by shrinking, since we can always reverse the roles of 
the sequences. 

If we wish to compare the two time series, and it happens that n = m, we can use 
the ubiquitous Euclidean distance: 

D(Q,C)^^t(q-c,y 

Since the square root function is monotonic and concave, we can remove the 
square root step and get identical rankings, clustering and classifications. This meas- 
ure is called the squared Euclidean distance: 
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D{Q,C)^±{q,-c,Y 

i=l 

In addition to the utility of slightly speeding up the calculations, working with this 
distance measure makes other optimizations possible [13]. 

If n is smaller than m, then the distance measures introduced above are not de- 
fined. To compare the two time series in this case, we have several choices; we can 
truncate C, and compare Q to [Cj,Cj,..., cj, or we can somehow stretch Q to be of 
length OT, or more generally we can stretch Q to be of length p, (n < p <m), truncate 
off the last m-p values of Q, then use squared Euclidean distance. The informal idea 
behind stretching can be captured in the more formal definition of scaling. To scale 
time series Q to produce a new time series QP of length p, the formula is: 

QP. = Qr,.„„i,l<;<p (5) 

Note that we can quickly obtain any scaling in 0(p) time. We call the ratio pin the 
scaling factor or sf. Slightly different definitions of scaling do exist, but they do not 
affect the results that follow. Fig. 5. visually summarizes the above definitions. 
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Fig. 5. A visual summary of the notation introduced in this section. From {left) to {right) A 
candidate time series C, and a shorter query Q. The squared Euclidean distance between Q and 
the first n datapoints in C can be visualized as the sum of the squared lengths of the gray hatch 



lines. The query Q can be stretched to length p, producing a new time series QP. In this case, 
QP is a good match to the first p datapoints in C 



3.1 Brute Force Search under Uniform Scaling 

If we wish to find the best scaled match between Q and C, we can simply test all 
possible scalings, as illustrated in Table 1. 

Table 1. An algorithm to find the best scaled match between two time series 

Algorithm: Test_All_Scalings (Q, C) 
best_match_val = inf; 

best_scaling_f actor = null; 
for p = n to m 

QP = rescale (Q,p) ; 

distance = squared_Euclidean_distance {QP , C[l..p]); 
if distance < best_match_val 

best_match_val = distance; 
best_scaling_f actor = p/n; 

end; 

end; 

return (best_match_val , best_scaling_factor) 
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The algorithm takes only 0{p*{m-n)) time and seems unworthy of any optimization 
effort. However, when mining real world datasets, rather than having a single candi- 
date time series C, we are typically confronted with massive collection of possible 
candidate time series, which will denote as C. As a motivating example, the MACHO 
dataset, a collection of star light curve microlensing events, has over 40 million time 
series [7]. To find the best scaled match to a query Q, in data collection C, we can use 
a brute force algorithm as shown in Table 2. 

Note that the time complexity for this algorithm is 0(|C| * (m-n)), this is simply 
untenable for large datasets. 



Table 2. An algorithm to find the best scaled match to query from a set of possible matches 

Algorithm: Search_Database_f or_Scaled_Match (Q, C) 
overall_best_time_series = null; 
overall_best_match_val = inf; 
overall_best_scaling = null; 

for i = 1 to number_of_time_series_in_ (C) 

[dist, scale] = Test_All_Scalings {Q, Ci) 
if dist < overall_best_match_val 

overall_best_time_series = i; 
overall_best_match_val = dist; 
overall_best_scaling = scale; 

end; 

end; 

return (overall_best_time_series , overall_best_match_yal , overall_best_scaling) 



3.2 Speeding up Search with Lower Bounding 

To speed up matching under uniform scaling we will rely on the classic idea of lower 
bounding. The intuition is this: given some technique for quickly calculating the 
minimum possible distance between the query and a candidate sequence at any possi- 
ble scaling, we can prune off many calculations. In more detail, we maintain a vari- 
able that contains the distance of the best-scaled match encountered thus far. Before 
calling the subroutine Test_All_Scalings on the next candidate time series, we 
first perform the quick lower bounding test. If the lower bound distance between the 
candidate and the query is greater than the distance of the best-scaled match already 
seen, we can simply discarded the candidate from consideration. For clarity, the idea 
is formalized in Table 3, although the algorithm differs from the algorithm in Table 2 
only in the addition of the lower bounding test as a precondition to the subroutine 
Test_All_Scalings . 

There are only two important properties of a lower bounding measure: 

• It must be fast to compute. A measure that takes as long to compute as 
Test_All_Scalings is of little use. We would like the time complexity to be at 
most linear in the length of the time series. 

• It must be a relatively tight lower bound. A function can achieve a trivial lower 
bound by always returning zero as the lower bound estimate. However, in order 
for the algorithm in Table 3 to be effective, we require a method that tightly 
bounds the value of the best match. 
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Table 3. A modified algorithm for searching for the best match under uniform scaling 

Algorithm: Faster_Search_Database_f or_Scaled_Match (Q, C) 
overall_best__time_series = null; 
overall_best_match_val = inf; 
overall_best_scaling = null; 

for i = 1 to number_of_time_series_in_ (C) 

if lower_bound_distance {Q, Ci) < overall_best_match__val 
[dist, scale] = Test_All_Scalings (Q, Ci) 

if dist < overall_best_match_val 

overall_best_time_series = i; 
overall_best_match_val = dist; 
overall_best_scaling = scale; 

end; 

end; 

end; 

return (overall_best_time_series , overall_best_match_val , overall_best_scaling) 



The idea of speeding up search using lower bounding is not new; in fact, it is the 
cornerstone of virtually every time series similarity search algorithm. However, while 
dozens of lower bounding measures are known for Euclidean distance [2, 5, 9, 11, 
12], and 3 lower bounding measures known for DTW [10], there are no lower bound- 
ing measures in the literature for uniform scaling. In the next section we introduce the 
first such measure. 



3.3 Lower Bounding Uniform Scalings 

To create a lower bounding distance measure for uniform scaling we will generate a 
bounding envelope. Bounding envelopes were introduced in [10] to lower bound 
DTW, and since then they have sparked a flurry of research activity [16, 18, 19]. 
While the principle is the same here, the definitions of the envelope are very differ- 
ent. In particular, we create two sequences U and L, such that: 

U, = max( c . ., c ) (6) 

L, = min( c , 1 , . . . , c ) (7) 

These sequences can be visualized as bounding the first n points of the time series C. 
Fig. 6. shows some examples. 
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Fig. 6. (Left) A time series C of length 100. (Center) The time series shrouded by upper and 
lower envelopes U and L with lengths 80. (Right) The same time series shrouded by upper and 
lower envelopes U and L with lengths 60 




Efficiently Finding Arbitrarily Scaled Patterns in Massive Time Series Databases 261 



Having defined the U and L, we can now introduce the lower bounding function, it 
was originally introduced in [10] for the problem of DTW. 



LB_Keogh{Q,C) = 'Z 



(q,-L,y ifq,<L, 
0 otherwise 



( 8 ) 



This function can be visualized as the squared Euclidean distance between any part of 
the query time series not falling within the envelope and the nearest (orthogonal) 
corresponding section of the envelope. Fig. 7. illustrates the idea. 





Fig. 7. (Left) A time series C and a shorter query Q. (Right) A visualization of the lower- 
bounding function LB_Keogh(Q,C). Note that any part of query time series Q that falls inside 
the bounding envelope is ignored. Otherwise the distance corresponds to the sum of the squared 
straight line distances from the query to the nearest point in the envelope (the gray hatch lines) 

We have claimed that LB_Keogh(Q,C) lower bounds the squared Euclidean distance 
between any scaling of Q, and the appropriate prefix of C. The proof is straightfor- 
ward, we omit it brevity. 



3.4 Further Optimizations 

While LB_Keogh(Q,C) is the optimal lower bound for uniform scaling, given only U 
and L, several further optimizations are possible in the context of similarity search. 
We will give one such example here, using concrete numbers for clarity. Suppose we 
are using the algorithm in Table 3 for similarity search, with n = 100, and m = 200. 
Further suppose that the best matching time series encountered thus far is at a dis- 
tance of 10. If we test the lower bound of the next candidate time series and we find it 
to be 1 1, we can prune it from the search space. However, if the lower bound is 9 we 
must call the Test_All_Scalings subroutine. 

We can observe, however, that although the lower bounding test did fail for the 
fairly drastic scaling factor of 2 (i.e. 200/100), it would be less likely to do so for 
smaller scaling factor, say 3/2. We could rescale the query to length 150, rebuild U 
and L and apply the lower bounding test again. If it happens that the lower bound is 
now 10 or greater, we could prune all possible scalings from length 150 to 200 from 
consideration, and only examine the scalings from 100 to 149. Of course, we could 
apply the above logic recursively to the scalings from 100 to 149, and more generally 
this suggests doing a binary search over all the scalings. We call this algorithm Bi- 
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nary_Test_All_Scalings, but omit a detailed description since it is rather obvi- 
ous. Note that we cannot use binary search to speed up the brute force algorithm, 
since the squared Euclidean distance does not vary monotonically with the scaling 
factor (in general). We use this optimization in all our experiments below. 



4 Experimental Results 



In this section we test our proposed approach with a comprehensive set of experi- 
ments. We compare only to the brute force search algorithm defined in Table 2, be- 
cause there are no other techniques in existence that support uniform scaling queries, 
with a single query. To eliminate the possibility of implementation bias [13], we will 
report the Pruning Power, the fraction of times that our approach must call the 
squared Euclidean distance function. 



Pruning Power = 



Number of calls to distance function by proposed approach 
Number of calls to distance function by brute force search 



(13) 



This measure depends only on the tightness of the lower bounds, and is independent 
of language, platform, caching or any other implementation details. As an additional 
sanity check we also measured the CPU time, however since it is almost perfectly 
correlated with the Pruning Power, we will omit it for brevity. 

It has been forcefully demonstrated that the quality of lower bounding measures, 
and therefore the speed of search, can vary greatly depending on the data [13]. We 
therefore tested our approach on a variety of datasets. Eig 8. shows a sample of each. 




Fig. 8. Randomly extracted samples of the time series datasets 



Since the speed-up obtained for our approach clearly depends on range of scaling 
factors and the length of the time series, we will test our approach for the cross prod- 
uct of scaling factors = [1.05, 1.10, 1.15, 1.20, 1.25} and time series candidate 
lengths of [ 16, 32, 64, 128, 256}. 

We conducted our experiments as follows. We randomly removed a subsequence 
of the appropriate length from the data to use as a query, then we randomly choose 
5,000 other subsequences to act as the database. We then searched for the best scaled 
match, noting the pruning power. We repeated this 100 times for every combination 
of scaling factors and candidate lengths. Eig 9. shows the results. 

The results are quite impressive, the worst case is a single order of magnitude 
speed-up, more generally two to three orders of magnitude speedup are observed. 
Note that, the pruning power seems independent of the candidate time series lengths, 
but does get worse as the scaling factor increases. This is to be expected, since for 
large scaling factors the LB_Keogh function has relatively little information with 
which to calculate the lower bound. 
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Fig. 9. The pruning power of LB_Keogh of 5 different datasets, over a range of scaling factors 
and candidate lengths 




Fig. 10. The pruning power of LB_Keogh on the hurst dataset, over a range of scaling factors 
and database sizes. Note the scale of the Z-axis is different from that of Fig. 9 

As with many indexing techniques, the pruning power of our approach improves 
with the size of the dataset. The intuition behind this effect is that the larger the data- 
set, the more likely we are to find a very close match early on in the search, and thus 
derive the maximum benefit from the lower bound pruning test (the outermost if 
statement in Table 3). To demonstrate this, we repeated the previous experiment for 
different size datasets. The results for just the burst dataset are shown in Fig. 10. 

The results clearly show that as the database size increases, the pruning power im- 
proves. This is a very desirable property when mining larger datasets. 



5 Discussion and Conclusions 

We have shown how to dramatically speed up similarity search under uniform warp- 
ing, however, we have not considered indexing under uniform warping. Fortunately 
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the ability to index the data comes for free! A technique for indexing envelopes under 
LB_Keogh was introduced in [10]. Since then, many other researchers have used this 
technique and suggested extensions [16, 18, 19] (Note that paper [19] claims to intro- 
duce the “concept of envelopes”, introduce must be a typo for review, since enve- 
lopes were introduced in [10]). This explosion of interest has ensured that indexing of 
time series envelopes has become a mature technology in only one year. We omitted 
empirical testing of indexing for brevity and clarity; we simply note that it works 
exceptionally well. We leave a full discussion for future work. 
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Abstract. Many organizations and companies have to answer large 
amounts of emails. Often, most of these emails contain variations of 
relatively few frequently asked questions. We address the problem of 
predicting which of several frequently used answers a user will choose to 
respond to an email. Our approach effectively utilizes the data that is 
typically available in this setting: inbound and outbound emails stored 
on a server. We take into account that there are no explicit links be- 
tween inbound and corresponding outbound mails on the server. We 
map the problem to a semi-supervised classification problem that can be 
addressed by algorithms such as the transductive support vector machine 
and multi-view learning. We evaluate our approach using emails sent to 
a corporate customer service department. 



1 Introduction 

Companies allocate considerable economic resources to communication with 
their customers. A continuously increasing share of this communication takes 
place via email; marketing, sales and customer service departments as well as 
dedicated call centers have to process high volumes of emails, many of them con- 
taining repetitive routine questions. It appears overly ambitious to completely 
automate this process; however, any software support that leads to a significant 
productivity increase is already greatly beneficial. Our approach to support this 
process is to predict which answer a user will most likely send in reply to an 
incoming email, and to propose this answer to the user. The user, however, is 
free to modify - or to dismiss - the proposed answer. 

Our approach is to learn a predictor that decides which of a small set of stan- 
dard answers a user is most likely to choose in reply to a given inbound message. 
We learn such a predictor from the available data: inbound and outbound emails 
stored on an email server. We transform the email answering problem into a set of 
semi-supervised text classification problems. Contrasting studies that investigate 
identification of general subject areas of emails {e.g., [8]), we explore whether 
text classification algorithms can identify instances of a specific frequently asked 
question. 
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Many approaches are known that learn text classifiers from data. The sup- 
port vector machine (SVM) {e.g., [10]) is generally considered to be one of the 
most accurate algorithms; this is supported, for instance, by the TREC filtering 
challenge [13]. The naive Bayes algorithm is also widely used for text classifica- 
tion. 

Among the known algorithms that utilize unlabeled data, the transductive 
SVM and the multi-view framework apply for support vector learning. The trans- 
ductive SVM [9] maximizes the distance between hyperplane and both, labeled 
and unlabeled data. In multi- view learning [3] , two classifiers which use different 
attribute sets provide each other with labels for the unlabeled data. 

The contribution of this paper is threefold. Firstly, we analyze the problem of 
answering emails, taking all practical aspects into account. Secondly, we present 
a case study on a practically relevant problem showing how well the naive Bayes 
algorithm, the support vector machine, the transductive support vector machine, 
and the co-training multi-view algorithm can identify instances of particular 
questions in emails. Thirdly, we describe how we integrated machine learning 
algorithms into a practical answering assistance system that is easy to use and 
provides immediate user benefit. 

The rest of this paper is organized as follows. In Section 2, we analyze the 
problem setting. We discuss our general approach and our mapping of the email 
answering problem to a set of semi-supervised text classification problems in 
Section 3. In Section 4, we briefly describe the transductive SVM and the multi- 
view algorithm that we used for the case study that is presented in Section 5. 
In Section 6 we describe how we have integrated our learning approach into the 
Responsio email management system. Section 7 discusses related approaches. 

2 Problem Setting 

We consider the problem of predicting which of n (manually identified) standard 
answers Ai, . . . , A„ a user will reply to an email. In order to learn a predictor, we 
are given a repository {x\, . . . , Xm} of inbound, and {yi, . . . , ym'} of outbound 
emails. Typically, these repositories contain at least hundreds, but often (at 
least) thousands of emails stored on a corporate email server. 

Although both inbound and outbound emails are stored, it is not trivial to 
identify which outbound email has been sent in reply to a particular inbound 
email; neither the emails nor the internal data structures of the Outlook email 
client contain explicit links. When an outbound email does not exactly match one 
of the standard answers, this does not necessarily mean that none of the standard 
answers is the correct prediction. The user could have written an answer that is 
equivalent to one of the answers Ai but uses a few different words. 

A characteristic property of the email answering domain is a non-stationarity 
of the distribution of inbound emails. While the likelihood P{x\Ai) is quite 
stable over time, the prior probability P{Ai) is not. Consider, for example, a 
server breakdown which will lead to a sharp increase in the probability of an 
answer like “we apologize for experiencing technical problems. . or consider 
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an advertising campaign for a new product which will lead to a high volume of 
requests for information on that product. 

What is the appropriate utility criterion for this problem? Out goal is to 
assist the user by proposing answers to emails. Whenever we propose the answer 
that the user accepts, he or she benefits; whereas, when we propose a different 
answer, the user has to manually select or write an answer. Hence, the optimal 
predictor proposes the answer Ai which is most likely given x {i.e., maximizes 
P{Ai\x)), and thereby minimizes the probability of the need for the user to 
write an answer manually. Keeping these characteristics in mind, we can pose 
the problem which we want to solve as follows. 

Problem 1. Given is a repository X of inbound emails and a repository Y of out- 
bound emails in which instances of standard answers A\, . . . , An occur. There is 
no explicit mapping between inbound and outbound mails and the prior prob- 
abilities P{Ai) are non-stationary. The task is to generate a predictor for the 
most likely answer Ai to a new inbound email x. 



3 Underlying Learning Problem 



In this Section, we discuss our general approach that reduces the email answering 
problem to a semi-supervised text classification problem. 

Firstly, we have to deal with the non-stationarity of the prior P{Ai). In 
order to predict the answer that is most likely given x, we have to choose 
argmaxiP(Ai|a;) = argmaxiP(x|Ai)P(Ai) where P{x\Ai) is the likelihood of 
question x given that it will be answered with Ai and P{Ai) is the prior proba- 
bility of answer Ai. Assuming that the answer will be exactly one of Ai, , An 
we have ^iP{Ai) = 1; when the answer can be any subset of {Ai,...,A„}, 
then P{Ai) + P{Ai) = 1 for each answer Ai. 

We know that the likelihood P{x\Ai) is stationary; only a small number 
of probabilities P{Ai) has to be estimated dynamically. Equation 1 averages 
the time dependent priors (estimated by counting occurrences of the Ai in the 
outbound emails within time interval t) discounted over time. 



m^) 









e;=o' 



o—Xt 
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We can now focus on estimating the (stationary) likelihood P{x\Ai) from the 
data. In order to map the email answering problem to a classification problem, 
we have to identify positive and negative examples for each answer Ai. 

We use the following heuristic to identify cases where an outbound email is a 
response to a particular inbound mail. The recipient has to match the sender of 
the inbound mail, and the subject lines have to match up to a prefix (“Re:” for 
English or “AW:” for German email clients). Furthermore, either the inbound 
mail has to be quoted in the outbound mail, or the outbound mail has to be sent 
while the inbound mail was visible in one of the active windows. (We are able 
to check the latter condition because our email assistance system is integrated 
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into the Outlook email client and monitors user activity.) Using this rule, we are 
able to identify some inbound emails as positive examples for Ai, . . . , An- 

We also need to identify negative examples. We can safely assume that no 
two different standard answers Ai and Aj are semantically equivalent. Hence, 
when an email has been answered by Ai, we can conclude that it is a negative 
example for all Aj, j ^ i. When the answer to an inbound email is different 
from all standard answers Ai, we cannot conclude that the inbound mail is a 
negative example for all standard answers because the response might have been 
semantically equivalent, or very similar, to one of the standard answers. Such 
emails are unlabeled examples in the resulting text classification problem. 

For the same reason, we cannot obtain examples of inbound emails for which 
no standard answer is appropriate; hence, we cannot estimate P(no standard 
answer) or P{Ai) for any Ai. Thus, we have a small set of positive and negative 
examples for each Ai. Additionally, we have a large quantity of emails for which 
we cannot determine the appropriate answer. 

Text classifiers typically return an uncalibrated decision function fi for each 
binary classification problem; our decision on the answer to x has to be based on 
the fi{x) (Equation 2). We discriminate each class Ai against all other classes; 
that is, we have to assume that Ai is independent of all fj{x) for i yf j. Since we 
have dynamic estimates of the non-stationary P{Ai), Bayes’ equation (Equation 
3) provides us with a mechanism that combines n binary decision functions and 
the prior estimates optimally. 



argmaxiP(Ai|a;) = argmaxiP(A*|/i(a;), . . . , /„(cc)) (2) 

« argmaxiP(Aj|/i(a;)) = a,rgma,XiP{fi{x)\A^)P{Ai) (3) 



Equation 3 is only applicable for discrete fi{x), while the decision function val- 
ues are really continuous. In order to estimate P{fi{x)\Ai) we have to fit a 
parametric model to the data. Following [2], we assume Gaussian likelihoods 
P{fi{x)\Ai) and estimate the Hi, fj,j and in a cross validation loop as follows. 
In each cross validation fold, we record the fi{x) for all held-out positive and 
negative instances. After that, we estimate /ij, fj,j and ai from the recorded de- 
cision function values of all examples. It is well known that Bayes’ rule applied 
to a Gaussian likelihood yields a sigmoidal posterior; Equation 4 corresponds to 
Equation 3 for continuous fi(x) and Gaussian P{fi{x)\Ai). 







/iU)+ 




+log 
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(4) 



We have now reduced the email answering problem to a semi-supervised text 
classification problem. We have n binary classification problems for which few 
labeled positive and negative and many unlabeled examples are available. We 
need a text classifier that returns a (possibly uncalibrated) decision function 
/i : A — >■ real for each of the answers Ai . 

We considered a Naive Bayes classifier and the support vector machine 
[9]. Both classifiers use the bag-of- words representation which con- 
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siders only the words occurring in a document, but not the word order. As pre- 
processing operation, we tokenize the documents but do not apply a stemmer. 
For we calculate tf.idf vectors. 

4 Using Unlabeled Data 

We briefly sketch two approaches that allow to utilize unlabeled data for support 
vector learning: the transductive SVM, and the co-training algorithm. 



4.1 Transduction 

In order to calculate the decision function for an instance x, the support vector 
machine calculates a linear function f{x) = wx + b. Model parameters w and b 
are learned from data {{xi,yi ), . . . , {xm, ym))- Note that -1-6 is the distance 
between plain (ru, 6) and instance Xi] this margin is positive for positive examples 
(j/i = -1-1) and negative for negative examples (j/i = —1). Equivalently, yi{^^^Xi + 
b) is the positive margin for both positive and negative examples. 

The optimization problem which the SVM learning procedure solves is to 
find w and 6 such that yi{wxi + b) is positive for all examples (all instances lie 
on the “correct” side of the plain) and the smallest margin (over all examples) 
is maximized. Equivalently to maximizing yi{-^^Xi + 6), it is usually demanded 
that yi{wxi -b 6) > 1 for all (xi,yi) and licl be minimized. 

Optimization Problem 1 Given data ((xi, yi), . . . , (xm, 2/m)); over all w, b, 
minimize subject to the constraint y'^iyiiwxi -b 6) > 1. 

The software package [9] implements an efficient optimization algo- 

rithm which solves optimization problem 1. The transductive support vector 
machine (TSVM) [10] furthermore considers unlabeled data. This unlabeled data 
can (but need not) be new instances which the SVM is to classify. In transduc- 
tive support vector learning, the optimization problem is reformulated such that 
the margin between all (labeled and unlabeled) examples and hyperplain is max- 
imized. However, only for the labeled examples we know on which side of the 
hyperplain the instances have to lie. 

Optimization Problem 2 Given labeled data ((xi, 2 / 1 ), ■ • ■ , {xm, 2/m)) and un- 
labeled data (xj, . . . , xj); over all w, b, (yl, . . . ,y’^), minimize \w\^ , subject to 
the constraints 'i'^iyiiwxi -b 6) > 1 and y”hiy*{wx* -b 6) > 1. 

The TSVM algorithm which solves optimization problem 2 is related to the 
EM algorithm. TSVM starts by learning parameters from the labeled data and 
labels the unlabeled data using these parameters. It iterates a training step 
(corresponding to the “M” step of EM) and switches the labels of the unlabeled 
data such that optimization criterion 2 is maximized (resembling the “E” step) . 
The TSVM algorithm is described in [9]. 
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Table 1. Co-training algorithm. 

Given positive examples {x\,X2,+), negative examples {x\,X2,—) and unlabeled ex- 
amples in two different views V\ and V2; number of iterations k. 

1 . Loop for k iterations 

(a) Train /i and /2 using the labeled positive and negative examples. 

(b) Let /i and /2 select the positive and negative example for which they make 
the most confident prediction. Remove the examples from the unlabeled data 
and add them to the labeled data. 

2 . Return the combined classifier f{x) = fi{xi) + f2{x2). 



4.2 Multi-view Learning 

Blum and Mitchell [3] have proposed the multi-view approach to utilizing unla- 
beled data. In multi-view learning, the available attributes V are split into two 
subsets Vi and V2 such that V1UV2 = V and Vi fl V2 = 0. A labeled example 
(x,a) is then viewed as (a:i,a;2,a) where xi contains the values of the attributes 
in Vi and X2 the values of attributes in V2- 

The co-training algorithm is the most prominent multi-view algorithm. The 
idea of co-training is to learn two classifiers fi{xi) and f2{x2) which bootstrap 
each other by providing each other with labels for the unlabeled data. Co-training 
is applicable when either attribute set suffices to learn the target f — i.e., there 
are classifiers fi and /2 such that for all x: fi{xi) = f2{x2) = f{x) (the com- 
patibility assumption). When the views are furthermore independent given the 
class labels - P{xi\f{x),X2) = P{xi\f{x)) - then co-training converts unlabeled 
examples into randomly drawn labeled examples [3]. 

As Vi, we use randomly drawn 50% of the words occurring in the training 
corpus; V2 contains the remaining words. fi{xi) and f2{x2) are trained from the 
labeled examples. Now fi selects two examples from the unlabeled data that it 
most confidently rates positive and negative, respectively, and adds them to the 
labeled examples. If the representations in the two views are truly independent, 
then the new examples are randomly drawn positive and negative examples for 
/2. Now /2 selects two unlabeled examples, the two hypotheses are retrained, 
and the process recurs. The algorithm is presented in Table 1. 

The compatibility and independence assumptions are usually violated in 
practice. However, empirical studies [14,11] show that co-training can never- 
theless improve performance. In particular, text classification problems seem to 
be particularly suited for co-training [15]. In our experiments, we use co-training 
in association with 

5 Case Study 

The data used in this study was provided by the TELES European Internet 
Academy, an education provider that offers classes held via the internet. In 
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order to evaluate the predictors, we manually labeled all inbound emails within 
a certain period with the matching answer. Table 2 provides an overview of the 
data statistics. Roughly 72% of all emails received can be answered by one of 
nine standard answers. The most frequent question “product inquiries” (requests 
for the information brochure) already covers 42% of all inbound emails. 



Table 2. Statistics of the TEIA email data set. 



Frequently answered question 


emails 


percentage 


Product inquiries 


224 


42% 


Server down 


56 


10% 


Send access data 


22 


4% 


Degrees offered 


21 


4% 


Free trial period 


15 


3% 


Government stipends 


13 


2% 


Homework late 


13 


2% 


TELES product inquiries 


7 


1% 


Scholarships 


7 


1% 


Individual questions 


150 


28% 


Total 


528 


100% 



We briefly summarize the basic principles of ROC analysis which we used to 
assess the decision functions [5,17]. The receiver operating characteristic (ROC) 
curve of a decision function plots the number of true positives against the number 
of false positives. By comparing the decision function against a decreasingly large 
threshold value we observe a trajectory of classifiers described by the ROC curve. 

The area under the ROC curve is equal to the probability that, when we 
draw one positive and one negative example at random, the decision function 
assigns a higher value to the positive example than to the negative. Hence, the 
area under the ROC curve (the A UC performance) is a very natural measure of 
the ability of a decision function to separate positive from negative examples. 

In order to estimate the AUC performance and its standard deviation for 
a decision function, we performed between 7 and 20-fold stratified cross vali- 
dation and averaged the AUC values measured on the held out data. In order 
to plot the actual ROC curves, we also performed 10-fold cross validation. In 
each fold, we filed the decision function values of the held out examples into one 
global histogram for positives and one histogram for negatives. After 10 folds, 
we calculated the ROC curves from the resulting two histograms. 

First, we studied the performance of a decision function provided by the 
Naive Bayes algorithm (which is used, for instance, in the commercial Auton- 
omy Answer system) as well as the support vector machine [9]. We 

use the default parameter settings for Figure 1 shows that the SVM 

impressively outperforms Naive Bayes in all cases except for one (TELES prod- 
uct inquiries). Remarkably, the SVM is able to identify even very specialized 
questions with as little as seven positive examples with between 80 and 95% 
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AUC performance. It has earlier been observed that the probability estimates 
of Naive Bayes approach zero and one, respectively, as the length of analyzed 
document increases [2]. This implies that Naive Bayes performs poorly when not 
all documents are equally long, as is the case here. 



Product inquiries 



TELES product inqueries 



Government stipends 




AUC iSVM) = .91 +- .008 
Al^ (NB) = .71 +- .022 



SVM - 
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Fig. 1. ROC curves for nine most frequently asked questions of naive Bayes and the 
support vector machine. 



In the next set of experiments, we observed how the transductive support 
vector machine improves performance by utilizing the available unlabeled data. 
We successively reduce the amount of labeled data and use the remaining data 
(with stripped class labels) as unlabeled and hold-out data (we use the same set- 
ting for the co-training experiments described in the following). We average five 
re-sampled iterations with distinct labeled training sets. We compare SVM per- 
formance (only the labeled data is used by the SVM) to the performance of the 
transductive SVM (using both labeled and unlabeled data). Table 3 shows the 
results for category “general product inqueries” ; Table 4 for “server breakdown” . 

When the labeled sample is of size at least 24 -|- 33 for “general product in- 
queries”, or 10 -I- 30 for “server breakdown”, then SVM and transductive SVM 
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Table 3. SVM and transdnctive SVM, “general product inqueries”. 



Labeled data 


SVM (AUC) 


TSVM (AUC) 


24 pos + 33 neg 
16 pos -1- 22 neg 
8 pos -1- 11 neg 


0.87 ± 0.0072 
0.855 ± 0.007 
0.795 ± 0.0087 


0.876 ± 0.007 
0.879 ± 0.007 
0.876 ± 0.0068 



Table 4. SVM and transdnctive SVM, “server breakdown”. 



Labeled data 


SVM (AUC) 


TSVM (AUC) 


10 pos -1- 30 neg 
5 pos -1- 15 neg 


0.889 ± 0.0088 
0.792 ± 0.01 


0.878 ± 0.0088 
0.859 ± 0.009 
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Fig. 2. Change of AUC performance with increasing numbers of co-training iterations. 



perform equally well. When the labeled sample is smaller, then the transductive 
SVM outperforms the regular SVM significantly. We can conclude that transduc- 
tion is beneficial and improves recognition significantly if (and only if) only few 
labeled data are available. Note that the SVM with only 8-1-11 examples (“gen- 
eral product inqueries”) or 5-1-15 examples (“server breakdown”), respectively, 
still outperforms the naive Bayes algorithm with all available data. 

Finally, we want to study how the performance changes when we use co- 
training in association with the Support Vector Machine. Figure 2 shows the 
AUC performance against the number of co-training iteration. The results re- 
semble those obtained with the TSVM: Co-training improves performance only 
when at most 16 positive examples are available for product inqueries and when 
at most 5 positive examples are available for server breakdown. The benefit of 
both, co-training and transduction is greatest, when only few labeled data are 
available. Transduction outperforms co-training for product inqueries; transduc- 
tion and co-training perform similar for server breakdown. 



6 The Responsio Email Management System 

We integrated the learning algorithms into an email assistance system. The key 
design principle is that, once the standard answers are entered, it does not require 
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^atei Bearbeilen ^nsichl Favorilen Ejjiras Aktjonen Responsio 2 




Online Kontaktformular; Sonstig^ 
. Online Kontaktformular; Studium 






Online Kontaktformular: Studium 



Vi 



Von: BerndMarnet An: TEIA*Customer*Care(Siteia.de 

BetrefT: Informationsmalerial Or*»-Studium "eBusiness Management" C 



Seht geehrte Daitien und Herren! 



Ich nochte Sie 
zu o.g. Prograj 



bitten, mir a: 
mti zuzusenden. 
Iiti Voraus Beaten Dank! 



Adcesse Hater: 



■ ^ ^ I 'wir freundlichen Grussen, 

WeiteteVetknupfu... I j Customer C 



ni=iD 



B AW; Informationsmateiial Online-Studium "eBusiness Man... 



Date! Gea^ilen Ansicht Einfiigen Format Extras Aktjonen Responsio ? 



iB^enden ’ H ja ^ j B QBtiorien... | 



Responsio ^AntwortEinfugen 



An , , , 1 1 Customer C 



Iptteff ; AW; Informationsmateiial Online-Studium "eBusiness Management" 



Sehr geehrter Herr Customer C, 

wif freuen uns uber Ihr Interesse am WebLearning-Angebot der 
TEIA. Mit der Post von heute erhalten Sie unsere allgemeine 
Informationsbroschure und die zurZeit zurVerfugung stehenden 
Produktblatter unserer Qualrfikationseinheiten. 



1 



Gestatten Sie mir in diesem Zusammenhang darauf aufmerksam zu 

machen, dass die Informationen auf unseren Web-Seite 

.'■.Vvw leia de bereits aktueller sind als in unserem Printmaterial. Auf ^ | 



E-Mail-Ubertragung: Fehler • hier klicken ^ 



Fig. 3. When an email is read, the most likely answer is displayed in a special field in 
the Outlook window. On clicking the “Auto- Answer” button, a reply window with the 
proposed answer text is created. 



any extra effort from the user. The system observes incoming emails and replies 
sent, but does not require explicit feedback. Responsio is an add-on to Microsoft 
Outlook. The control elements (Figure 3) are loaded as a COM object. 

When an email is selected, the COM add-in sends the email body to a second 
process which identifies the language of the email, executes the language specific 
classifiers and determines the posterior probabilities of the configured answers. 
The classifier process notifies the COM add-in of the most likely answer which 
is displayed in the field marked. When the user clicks the “auto answer” button 
(circled in Figure 3), Responsio extracts first and last name of the sender, iden- 
tifies the gender by comparing the first names against a list, and formulates a 
salutation line followed by the proposed standard answer. The system opens a 
reply window with the proposed answer filled in. 

Whenever an email is sent, Responsio identifies whether the outbound mail 
is a reply to an inbound mail by matching recipient and subject line to sender 
and subject line of all emails that are visible in one of the Outlook windows. 
When the sent email includes one of the standard answers, the inbound mail 
is filed into the list of example mails for that answer. These examples can be 
viewed in the Responsio manager window. It is also possible to manually drag 
and drop emails into the example folders. Whenever an example list changes, 
the training unit starts a process with the learning algorithm. 



7 Discussion and Related Results 

We have discussed the problem of identifying instances of frequently asked ques- 
tions in emails, using only stored inbound and outbound emails as training data. 
Our empirical data shows that identifying a relatively small set of standard ques- 





276 



Michael Kockelkorn, Andreas Liineburg, and Tobias Scheffer 



tions automatically is feasible; we obtained AUC performance of between 80 and 
95% using as little as seven labeled positive examples. The transductive support 
vector machine and the co-training algorithm utilize the available unlabeled data 
and improve recognition rate considerably if and only if only few labeled training 
examples are available. The drawback of both semi-supervised algorithms is the 
increase in computation time from few seconds to several minutes. For use in a 
desktop application, efficiency is a crucial factor. 

A limitation of the available data sources is that we cannot determine ex- 
amples of emails for which no Ai is appropriate (we cannot decide whether two 
syntactically different answers are really semantically different). Therefore, we 
can neither estimate P(no standard answer) nor P{Ai) for any Ai. 

Information retrieval offers a wide spectrum of techniques to measure the 
similarity between a question and questions in an FAQ list. While this approach 
is followed in many FAQ systems, it does not take all the information into account 
that is available in the particular domain of email answering: emails received in 
the past. An FAQ list contains only one single instance of each question whereas 
we typically have many instances of each questions available that we can utilize 
to recognize further instances of these questions more accurately. 

The domain of question answering [20] is rather loosely related to our email 
answering problem. In our application domain, a large fraction of incoming ques- 
tions can be answered by very few answers. These answers can be pre-configured; 
the difficulty lies in recognizing instances of these frequently asked questions ro- 
bustly, even in very ungrammatical emails. Question answering systems solve a 
problem that is in a way more difficult: selecting an answer sentence from a large 
corpus (such as an encyclopedia) for arbitrary questions. 

Several email assistance systems have been presented. [8,4,19,7] use text clas- 
sifiers in order to predict the correct folder for an email. In contrast to these 
studies, we study the feasibility of identifying instances of particular questions 
rather than general subject categories. 

Related is the problem of filtering spam email. Keyword based approaches. 
Naive Bayes [1,16,18,12] and rule-based approaches [6] have been compared. 
Generating positive and negative examples for spam requires additional user 
interaction: the user might delete interesting emails just like spam after reading 
it. By contrast, our approach generates examples for the email answering task 
without imposing additional effort on the user. 
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Abstract. There are many practical applications where learning from 
single class examples is either, the only possible solution, or has a distinct 
performance advantage. The first case occurs when obtaining examples 
of a second class is difficult, e.g., classifying sites of “interest” based 
on web accesses. The second situation is exemplified by the one-class 
support vector machine which was the winning submission of the second 
task of the KDD Cup 2002. 

This paper explores the limits of supervised learning using both positive 
and negative examples. To this end, we analyse the KDD Cup dataset 
using four classifiers (support vector machines and ridge regression) and 
several feature selection methods. Our analysis shows that there is a 
consistent pattern of performance differences between one and two-class 
learning for all algorithms investigated, and these patterns persist even 
with aggressive dimensionality reduction through automated feature se- 
lection. Using insight gained from the above analysis, we generate syn- 
thetic data showing similar pattern of performance. 

1 Introduction 

A standard approach for two class discrimination is to use examples from both 
classes to generate a model for discriminating them. This approach is so en- 
trenched in machine learning that practitioners often will not consider data 
unless it contains examples of both classes. Moreover, many machine learning 
algorithms, such as decision trees, naive Bayes or multilayer perceptron, do not 
function unless the training data includes examples from two classes. However, 
there are many applications where obtaining examples of a second class is diffi- 
cult, e.g., classifying sites of “interest” to a web surfer where the sole information 
that is available are the positive examples or sites that are of interest to the user. 
In such a case, learning from examples of one class is the only possible solution. 

In addition, there are situations when the data has heavily unbalanced rep- 
resentatives of the two classes of interest, e.g., fraud detection and information 
filtering. A supervised algorithm applied to such a problem has to implement 
some form of balancing. In some situations, it may be beneficial to design re- 
balancing even more radically than warranted by unequal proportions, and ig- 
nore the large pool of negative examples and learn from positive examples only. 
A real life learning problem that has benefited from such an approach is the sec- 
ond task of the KDD Cup 2002 [4] , where the winning submission learnt using 
just the positive examples which consisted of < 3% of the training data [8]. 
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This paper explores the limits of two-class learning and analyses situations 
when this discrimination learning may break down. This exploration begins with 
an analysis of the KDD Cup dataset using four different classifiers: support vec- 
tor machines and ridge regression, in several different settings. We then study 
the performance of these classifiers in the one-class and two-class mode when 
the input feature space is significantly reduced using automatic feature selec- 
tion methods. The consistently better performance of the one-class models in 
the above analysis, leads us to a systematic study of conditions when one-class 
learning is advantageous. This study using synthetic data and the four classifiers 
used in the earlier experiments shows that data with a certain combination of 
properties, e.g., the presence of label noise, sparsity of features and low propor- 
tion of minority class, lends itself to better performance with one-class learners. 

The paper is organised as follows. Section 2 places our research in context of 
existing research. Section 3 introduces the basic support vector machines in the 
particular form used for this research, and describes the performance measure 
suitable for our task. We then present results of our experiments with the KDD 
2002 Cup data in Section 4.1 and that for synthetic data in Section 4.2. Finally, 
we discuss the implications of our results in Section 5. 

2 Related Research 

The problem of discrimination of unbalanced classes is encountered in a large 
number of real life situations, e.g., detection of oil spills in satellite radar im- 
ages [9], information retrieval and filtering [10] and biological domains [4,8]. 
Many solutions have been proposed to address the imbalance problem including 
sampling and weighting examples (cf. [7] for a thorough survey). However, they 
typically focus on cases when the imbalance ratio of minority to majority class is 
around 10:90. In this paper, we focus on extreme imbalance, where the minority 
class consists of around 1-3% of the data, and extend the sampling to situations 
when one of the classes is ignored completely and learning is accomplished using 
examples from a single class. 

A possibility of single class learning with support vector machines (SVM) 
has been noticed previously. In particular, Schdlkopf et al. [14] have suggested 
a method of adapting the SVM methodology to one-class learning by treating 
the origin as the only member of the second class. This methodology has been 
used for image retrieval [3] and for document classification [11]. In both cases, 
modelling is performed using examples from the positive class only, and the 
one-class models perform reasonably, although much worse than the two-class 
models learnt using examples from both classes. In contrast, in this paper, we 
show that for certain problems one-class models can perform better. 

3 Classifiers and Performance Metrics 

In this section we recall basic concepts of kernel machines in a form suit- 
able for this paper. Given a training sequence (xi,yi) of binary n- vectors Xi € 
{0, 1}"CK" and bipolar labels G {±1} for i = l,...,m. The case of prime 
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interest here is when the target class, labelled +1, is much smaller than the 
background class (labelled —1), consisting of a minute fraction, « 1 — 3%, of the 
data. Our aim is to find a “good” discriminating function / : {0, 1}" — >■ K that 
scores the target class instances higher than the background class instances. The 
solution will be given in a form of a kernel machine 

m 

f{x) = f^{x)+b:=Y^^(3ik{x,Xi) + b (1) 

where k : R”xK” — >■ K is a kernel function of one of the forms specified below and 
/3i, 6 € R are parameters to be defined for the given training set as the minimiser 
of the regularised risk of the form as follows. 

m 

iPi,b) ^ \\f^ ,bW^ + - y,{f^{xi) + b)), (2) 

i=l 

where C+i, C_i > 0 are class dependent regularisation constants, (/> : R — >• R+ is 
a convex loss function penalising deviations of scores from allocated labels and 
II . II is a norm as specified below. Now we specify variations of the regularised 
risk (2) leading to four different cases of kernel machines used in this paper. 

1. SVM^: For the popular support vector machine with linear penalty we use 
the norm ||/^, 6|p := ||/^||fe, where 

m 

ll/^llfc := Mjk{xi,Xj), 
i,i=l 

and the “hinge loss” (j){6) := max(O,0), 0 G R [5,15,16]; 

2. hSVM^: Replacing the norm in the above definition by 

\\f^,br:=\\f^\\l + b^ (3) 

we obtain the homogeneous support vector machine with linear penalty, 

3. hSVM'^: For the (homogeneous) support vector machine with quadratic 
penalty [5] we use norm (3) and the squared hinge loss (j>{9) := (max(0,6*))^ for 
0 G R; 

4. hRN^: For the regularisation network [6,17] or ridge regression (c.f. [5,6,17] 
we use norm (3) and the ordinary square loss (f>{9) := (0)^ for 0 G R. 

If the kernel k satisfies the Mercer theorem assumptions [5,15,16] then for 
the minimiser of (2) we have /3i = where Oi > 0 for i = 1, 

In our investigations we shall be using the popular polynomial kernel 

n 

k{x, x') = {x ■ xY = 

i=l 

for X = Y) x' = (^') from {0, 1}” and degree d= 1,2,3 and 4. 

Note that hSV , hSV and hRN^ implement classifiers that correspond 
to separation of the data {zi, yi) := Y{xi), 1, yi) G R^xRx{±l} by a hyperplane 
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in the extended feature space passing through the (0, 0) G : M" — >■ 

is a feature mapping of the observation space K" into an appropriate Euclidean 
space {the features space). In particular such a solution is provided also if 
all data points belong to a single class, i.e. if yi = const. 

The geometrical meaning of the solution (2) can be most clearly illustrated 
in the limiting case of “hard margin”, i.e. C — >■ oo. In such a case, the optimal 
solution of (2) corresponds to the direction of the shortest vector to the convex 
shell spanned by all vectors piZi £ xK, i = 1, ..., m. 

3.1 Re-balancing of the Data 

Two way of compensation for the imbalance in the training data will be inves- 
tigated in this paper. 

Hard balancing. We use all m+ positive instances but randomly choose only 
m_ instances with the negative labels (the majority class) varying the “mixture” 
ratio 1?-/+ = m_/m+. We set the regularisation constants to C_i = C+i = 
C lm+, where C > 0 is a chosen constant. In this form of balancing, S-/+ = 0 
is the case of positive 1-class learning, and B-/+ = 1 represents the case of 
balanced 2-class learning when the same number of examples from both classes 
are used. 

Soft balancing. We use all available training data but with different class 
regularisation constants: C-\ = (1 — B)C/2m- and C-\ = (1 -I- B)C/2m+, 
where C > 0 and —1 < B < -l-lisa balance parameter. Here B = -\-l and 
B = —1 correspond to 1-class learning, and H = 0 is 2-class learning with both 
classes “balanced” according to their prior proportions. 

The advantage of the hard balancing over the soft balancing is the speed 
of generation of a solution, as it typically uses a smaller training set. For this 
reason the hard balancing was used in the most of our experiments. 



3.2 Centroids 



Now we introduce the fifth and the simplest of the five algorithms considered 
here in terms of generation of the solution. In contrast to SVMs, it is inherently 
non-sparse and so can be complex to implement in the case of non-linear kernels. 

Algorithm 5, Cntrs- For the centroid classifier we set 



f{x) = fcntrix) ■■= 



((1 + B) 'Ei,v,=+i x) (1 - B) Ei,v,=-i x) 



2max(l, TO+) 



2 max(l, TO_) 



where x G M" and —1<B< -|-1 is the balance factor. 

In terms of the feature space, the centroid classifier implements the projection 
on the direction of a weighted differences between centroids of data from both 
class labels. For H = — 1 it is the direction of the majority class, for B = -|-1 
that of the minority class and for H = 0 that of the difference between centroids 
of both classes. 
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Now we formally link centroids and SVMs. Later we shall connect this result 
to some of our empirical findings. The formal proof, omitted here, can be easily 
derived from Karush-Khun-Tucker conditions for SVM solution. 



Theorem 1 If functions k{xi,.), i = l,...,m are linearly independent, then 



lim 

c^o+ C c^o+ C 




( 4 ) 



for p = 1, 2, where f^ denotes the homogeneous part of the soft balanced 
SVM solution (1) for the appropriate machine. Moreover, if both classes are 
represented in training, i.e. for the soft balance factor —1<B< +1, then: 



fH 

lim 

C— >0+ C 




( 5 ) 



3.3 Performance Measures 

We have used AROC, the Area under the Receiver Operating Characteristic 
(ROC) curve as our main performance measure. In that we follow the steps of 
KDD 2002 Cup, but also, we see it as the natural metric of general goodness of 
classifier (as corroborated below) capable of meaningful results even if the target 
class is a tiny fraction of the data. 

We recall that the ROC curve is a plot of the true positive rate or precision, 
P{f{xi) > 6\yi = 1), against the false positive rate, P{f{xi) > 9\yi = —1), as 
a decision threshold 9 is varied. The concept of ROC curve originates in the 
military signal detection but these days it is widely used in many other ar- 
eas, including data mining, psychophysics and medical diagnosis (cf. review [2]). 
In the latter case, AROC is viewed as a measure of general “goodness” of a 
test, formalised as a predictive model / in our context, with a clear statisti- 
cal meaning as follows. According to Bamber’s interpretation [1], AROC{f) is 
equal to the probability of correctly answering the two-alternative-forced-choice 
problem: given two cases, one Xi from the negative and the other Xj from the 
positive class, allocate scores in the right order, i.e. f{xi) < f{xj). Additional 
attraction of AROC as a figure of merit is its direct link to the well researched 
area of order statistics via [/-statistics and Wilcoxon-Whitney-Mann test [1]. 

There are some ambiguities in the case of AROC estimated from a discrete 
set in the case of ties, i.e. when multiple instances from different classes receive 
the same score. Following [1] we implement in this paper the definition 

AROC{f) = P{f{xi) < f{xj)\ -Vi = yj = 1) 

+0.5P{f{xi) = f{xj)\ -Vi = Vj = 1) 

expressing AROC in terms of conditional probabilities, which can be re-formu- 
lated in terms the rank-ordered test sequence (where the rank is imposed by the 
scores allocated by /). 

Note that the trivial uniform random predictor has AROC of 0.5. 
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4 Experiments 

In order to understand the boundaries when the performance of two-class clas- 
sifiers deteriorate, we have explored the following datasets: (1) Real life data in 
the form of the Aryl Hydrocarbon Receptor signalling pathway data provided 
for the second task of the 2002 KDD cup (henceforth referred to as the AHR 
data) (Section 4.1), and (2) Synthetic data created with some specific properties 
such as presence of noise in labels (Section 4.2). 



4.1 Analysis of KDD Cup 2002 Data 

In our main experiments we have used AHR-data set which is the combined 
training and test data sets used for task 2 of KDD Cup 2002. The data set is 
based on experiments by Guang Yao and Chris Bradfield of McArdle Laboratory 
for Cancer Research, University of Wisconsin. These experiments aimed at iden- 
tification of yeast genes that, when knocked out, cause a significant change in 
the level of activity of the Aryl Hydrocarbon Receptor signalling pathway (cf. [4] 
for more details). In this paper we follow the setting of the “broad task” of the 
KDD Cup: the discrimination between 127 ‘positive’ genes from the combined 
class encompassing the labels “change” and “control” and the remaining 4380 
genes forming the ‘negative’ class. We note that the results for the first subtask, 
namely, learning “change” class are similar [8]. In our experiments this set has 
been repeatedly split into 70% for training and 30% for testing. All averages and 
standard deviations reported are for independent tests on 20 such random splits. 

Each of the 4507 instances in the data set is described by a variety of infor- 
mation that characterise the gene associated with the instance, e.g., associated 
abstracts from scientific articles, genes whose encoded proteins physically interact 
with one another, information about the subcellular localisation and functional 
classes of the proteins encoded by various genes. For the experiments described 
in this paper, we convert all of the information from the different files to a sparse 
matrix containing 18330 binary features as described in [8]. 



Impact of Regularisation Constant. Figure 1 shows mean AROC as a func- 
tion of C for four different linear kernel machines (d = 1) with the hard balancing 
(Figures A-D) and the soft balancing (Figures E-H). We use four different modes 
as follows: (i) positive 1-class (i?-/+ = 0 and B = -|-1, solid line); (ii) negative 
1-class ( B = —1, dotted line); (in) balanced 2-class (B_/+ = 1 and B = 0, 
dashed line); (iv) un-balanced 2-class ( B_j_^. = 35 « 4380/127) when all ex- 
amples from both classes are used, the dash-dot line). The standard deviations 
are shown as vertical bars. 

An inspection of plots brings a number of interesting observations: 

1. The un-balanced 2-class machines (dash-dot lines) and negative 1-class 
machines (dot lines) have inferior performance relative to either positive 1-class 
machines or the balanced 2-class SVMs for most values of C (excepting very low 
C). Thus only the last two modes will be used in further research in this paper. 
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Fig. 1. Mean AROC for AHR-data as a function of the regularisation constant C for 
1-class and 2-class SVMs with the hard balancing (figures A-D) and the soft balancing 
(figures E-H) for four linear (d = 1) SVMs and four different modes: (i) positive 1- 
class (B_/+ = 0 and B — +1, solid line); {ii) negative 1-class {B = —1, dotted line); 
(Hi) balanced 2-class (i?_/+ = 1 and B = 0, dashed line); (iv) un-balanced 2-class 
(B_/_i_ = 35, the dash-dot line). The standard deviations are shown as vertical bars. 



2. All positive 1-class and balanced 2-class machines show a very good and 
roughly equal performance for very low values of C. Additionally, these values 
of mean AROC are equal to that for the positive 1-class centroid {B = -|-1, 
AROC = 62.4 ± 3.4) and the balanced 2-class centroid Cntro {B = 0, AROC = 
61.5 ± 3.9) trained on the whole data. In the case of soft balancing this can be 
inferred from Theorem 1 since it implies that the orders imposed on test data 
by the scores of the respective centroid and SVM classifiers coincide hence yield 
the same AROC (uniquely determined by such an order) . 

3. There are noticeable differences between the performance of different 
SVMs. For instance, note the differences between unbalanced 2-class hSVM^ 
and SVM^ (dash-dot lines in Figures lA and IB, respectively). 

4. Positive 1-class hSVM'^ is very robust across the whole range of C values 
(cf. the solid line in Figure 1C). In particular, for high values of C, i.e, virtually 
the hard margin case [8], it performs better than any other SVM tested. This 
setting was used for the winning submission to KDD Cup 2002. 

In summary, the top performance by SVMs is achieved at extremely high 
values of C, i.e. hard margin case, or at the limit of very low C. For low C the 
best SVMs are equivalent to respective centroid machines, for positive one-class 
{B = 1) or the balanced 2 class {B = 0). For the high Cs, the positive one- 
class consistently outperforms other settings. This motivates our restrictions on 
experimental settings for the rest of the paper as follows. We shall concentrate 
exclusively on hard balanced SVMs trained with high Cs and centroid classifiers 
with B set to 0 (Cntro) for a range of hard balance mixture ratios 
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Fig. 2. Mean AROC with ± standard deviation envelopes for AHR-data as a function 
of the hard balanced data with mixtnre ratio R-/+ £ {0,0.01,0.1,0.5, 1,5, 10,35} for 
four different fractions of the original feature set (0.1%, 1%, 10% and 50% out of 18,330 
features). The three linear (d = 1) kernel machines were trained with C = 1000, while 
the centroid classifiers were developed with B = 0. 



Impact of Feature Selection. We have used several different feature selection 
strategies including the document frequency thresholding [13], [18], mutual 

information [18], information gain [12], inverse document frequency - term fre- 
quency [13] and average discrimination scoring [13]. 

The results obtained have very similar trends, and due to space limitations 
we present in Figure 2 plots for only two selection methods: and mutual infor- 

mation. The main thing to observe is the trend of the dropping performance by 
SVMs as the negative class examples are added. For hSVM^ this is visible even 
with 18 features selected by method. (This is also the case for all the other 
methods we have tested except mutual information.) The poor performance of 
mutual information at low fractions of features is exceptional among the tech- 
niques we have tested. It can be explained by a strong influence of this criterion 
by the marginal probability of features which tends to favour “rare” features 
rather than common ones. 

The performance of SVM^ is generally poorer than that of hSVM^ or 
hSVM^ at the same settings. Further, there are peaks and valleys at different 
mixture ratios that warrant further investigation into the behaviour of SVM^. 

The performance of the centroid classifiers Cntro is quite different. There is 
a pronounced dip with high variance when very few negative examples are used 
< 0.05, which represents 1-6 negative examples only). However, as more 
negative examples are included, the performance improves and catches up with 
that of the positive 1-class classifier. Further, this pattern is consistent even at 
extremely low number of features. 
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4.2 Experiments with Synthetic Data 

We observed in Section 4.1 that even in low dimensional space, the phenomenon 
of better performance with one-class learner persists. Our intuitive explanation 
here is that if the learner uses the minority class examples only, the “corner” 
(the half space) where minority data resides is properly determined. However, the 
minority class is “swamped” by the background class, hence once the background 
instances are added, the SVM solution becomes suboptimal. Now we explore this 
intuition using synthetic data. 

We use three data sets of instances of similar structure. The observation 
vectors in these synthetic data sets contain a small number Umf of informative 
attributes and the remaining, larger number, Unoise, of noise attributes. These 
attributes are binary, generated according to uniform random distribution with 
probabilities Pmf and Pnoise of value = -1-1, respectively. The informative at- 
tributes determine the labels modulo the additional label noise which is the 
random reversal of certain proportions of labels, namely the proportions LN^ of 
the positive and L7V_ of the negative labels. In all sets, we generate m = 9000 
instances of which Py=+i = 3% have labels y = +1. 

— Si : For this data set we use n = Umf + Unoise = 1 + 999 dimensions and 
Pnoise = 2%. The labels are generated as a random bipolar label vector 
y G {±1}®™° with the proportion Py=+i = 3% of positive examples. For the 
informative dimension we set Xmf = {y + l)/2 G {0,1} and then change 
randomly the proportion LN- = 20% of Os to Is. 

S 2 ' In this case n^nf — 10, nnoise — 990, Pinf — 5%, Pnoise — 2%. Hav- 
ing defined informative attributes Xinf^i G for i = 1,...,9000, we have 
randomly generated a vector v G then chosen a bias 6 G M such that 
for 2004 (« 22%) instances i we got the scores Xinf,i ■ v > b. Of these 2004 
instances, we randomly select 270 instances (= 3% of 9000) and label them 
-1-1 and the remaining 8730 instances we labelled —1. 

— S 3 : This set was designed to test the impact of non-linear kernels. It is 
generated as Si with the difference that only n = Umf + Unoise = 1-1- 19 = 20 
dimensions are used and the random proportions TfV+ and LN- of the both 
-1-1 and of 0 entries, respectively, are reversed in the second phase of the 
generation of the informative attribute Xinf. 

In experiments, each set of 9000 instances generated as described above, was 
split randomly into 3000 training and 6000 test instances, with proportional sam- 
pling (without replacement) from both classes. All results reported are averages 
of 20 such random splits. 

Figure 3 presents the results of experiments evaluating AROC as a function 
of the mixture ratio for the four kernel machines. For all three data sets, 

we show the results for the linear kernel (Figures 3A-3C), and for S 3 we show 
the impact of higher degree polynomial kernels (Figures 3D-3F). 

The results, especially for hSVM^, strikingly resemble those obtained for the 
AHR data (c.f Figure 2), with the consistent pattern of decreasing performance 
with increasing proportion of negative class instances. Note the ‘collapse’ of 
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A: S1,d=1 
1 | 




B: S2,d=1 C: S3,d=1 D; S3,d=2 E: S3,d=3 




F: S3,d=4 






Fig. 3. Mean AROC with ± standard deviation envelopes as a function of the mixture 
ratio B-/+ for four machines with C — 1000: SVM^, hSVM^, hSVM^ and hRN^ . 
Plots A, B, C show results for linear kernels with Si, S2 and S3, respectively. Results 
with higher degree polynomial kernels {d = 2, 3, 4) are shown for S3 in plots D, E and 
F, respectively. B_/+ = [0, 0.01, 0.1, 0.5, 1, 5, 10, 35]. 

SVM^ algorithm when negative class proportion is very low and significantly 
better performance of hSVM^ than SVM^ for data dominated by the positive 
class (i?_/+ « 0) and reverse of it for data dominated by the negative class 
(B_/+«l). 

As kernel degree increases we observe the familiar pattern of decreasing per- 
formance with increasing dominance of negative class instances (Figures 3C-3F). 
Thus, the relatively low dimensional S3 data set when used with higher degree 
polynomial kernels behaves in a way similar to that of the high dimensional 
datasets Si and S2 with linear kernels. 

We also experimented with different values of the regularisation constant C, 
but found that this had marginal impact on AROC in the above settings. 

In addition, our experiments with different label noise settings {LN^ and 
LN_ ) show that the pattern of decreasing performance with increasing amounts 
of negative class instances persists with different levels of label noise. 

5 Discussion 

It is interesting that in our extensive experiments, while positive one-class clas- 
sifiers using « 3% of the AHR-data provide best models, learning with negative 
class examples only provides very poor models (eg., mean AROC = 62.4 ±3.4% 
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vs. 38.3 ± 3.3% for centroid models and hSVM^ , hSVM'^ and hRN'^ classifiers 
at the low C limit) . Further, the unbalanced 2-class SVMs perform consistently 
below the level of the trivial random classifier AROC < 50%. The reasons for 
this behaviour need to be investigated more thoroughly. 

Our experiments indicate that performance of one class learning method is 
dependent on learning machine, though the dominant tendency of performance 
deterioration with increased influence of the majority (the negative) class was 
visible for our range of classifiers. In particular, the popular support vector ma- 
chine with linear penalty (SVM^) performed poorly for minority class dominated 
learning while performing much better for the majority class learning setting. 
The minor modification of this algorithm (hSVM^) has demonstrated quite 
opposite properties, while the support vector machine with quadratic penalty 
{hSVM"^) performed relatively well over the whole range of settings. This ap- 
praisal of the positive one-class hSVM'^ holds true in particular, for its per- 
formance on AHR-data, where learning is implemented with 85 minority class 
instances in the full 18330 dimensional feature space. 

The degradation of learning performance in the presence of abundant neg- 
ative examples has been noted in [9]. Their solution of focusing on the best 
positive region works in low dimensional input space when there is a single re- 
gion to be labelled as positive and the minority class is around 10% of the data. 
For our situation of very high dimensional input space with around 3% minority 
class data, the more drastic solution of totally ignoring negative class examples 
seems to work better for all machines. 

Further, even when the sparse high dimensional space is reduced to a more 
dense representation via aggressive feature reduction methods, the advantage of 
one-class learners persist. This indicates that there is a combination of factors 
involved in this phenomenon, and a more thorough investigation with synthetic 
data is warranted. 

Experiments with the polynomial kernels seem to indicate that interactions 
between the 19 noisy attributes in the set S 3 are equivalent to explicit addition 
of hundreds of extra noise attributes in the datasets Si and S' 2 . The higher the 
degree of the kernel, the more such ‘noisy’ virtual attributes are added (on the 
level of the feature space) and the more pronounced is the difference between 
one-class and two-class learning. Note that in this case, in contrast to the case 
of AHR-data case, the range of AROC values is around 60-90%. 

6 Conclusion 

We have shown that learning from positive examples only can be advantageous 
for real life data such as AHR-data used in KDD Cup 2002 in term of classifier 
accuracy but is not restricted to this data set. A few synthetic data sets tested 
in this paper show that favourable conditions for such learning method can 
naturally arise in many other situations, in particular when popular support 
vector machines with non-linear kernels are used. More research is required to 
study these conditions. 
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Our experiments demonstrate that one-class learning from positive class ex- 
amples can be a very robust classification technique when dealing with very 
unbalanced data and high dimensional noisy feature space. It can be used as 
an alternative to aggressive feature selection usually used in such situations and 
can be very attractive for learning with non-linear kernels, when direct feature 
selection on the feature space level cannot be implemented. 
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Abstract. The naming of natural features, such as hills, lakes, springs, 
meadows etc., provides a wealth of linguistic information; the study of 
the names and naming systems is called onomastics. We consider a data 
set containing all names and locations of about 58,000 lakes in Fin- 
land. Using computational techniques, we address two major onomastic 
themes. First, we address the existence of local dependencies or repul- 
sion between occurrences of names. For this, we derive a simple form 
of spatial association rules. The results partially validate and partially 
contradict results obtained by traditional onomastic techniques. Second, 
we consider the existence of relatively homogeneous spatial regions with 
respect to the distributions of place names. Using mixture modeling, we 
conduct a global analysis of the data set. The clusterings of regions are 
spatially connected, and correspond quite well with the results obtained 
by other techniques; there are, however, interesting differences with pre- 
vious hypotheses. 



1 Introduction 

In spatial statistics, a point process is a random process that produces points in 
the Euclidean plane. A realization of such a process, i.e., a set of points, is called 
a point pattern, or spatial point data [1,2]. A marked point process consists of 
several point processes producing different types of points. The points are often 
also called events. 

Marked point processes arise in many applications, such as linguistics (in the 
study of dialects or place names, each word, grammatical construct, name, etc. 
corresponds to a different type of event) , biodiversity studies (different types of 
events correspond to, e.g., different types of plants, and the locations are the 
places in which the plant has been observed), business applications (locations 
of customers etc.). There are some fundamental differences in the point data 
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in these applications. The most relevant here is that in some of these cases the 
point data represents an underlying phenomenon that is contionuous (e.g. the 
occurrence area of a species, or the area in which a particular word is used), while 
in others the underlying phenomenon is itself discrete. In the current study we 
discuss the latter type of point processes. 

The analysis of high-dimensional point processes can be quite demanding. 
The data is often sparse, i.e., we have only fragmentary information of the un- 
derlying phenomenon. When there are several different types of events, mod- 
eling their interaction can be complex. In many cases the observed quantities 
are results of several unobserved processes. The granularity and accuracy of the 
locations of the points can vary: sometimes the event can be localized perfectly, 
sometimes not. 

Spatial statistics (see, e.g., the books [1,2]) has developed several strong 
methods for analyzing a single point process. However, marked point processes 
with a high number of different types of events have received less attention. 

This paper is a case study in the use of pattern discovery and mixture mod- 
eling for the analysis of a high-dimensional marked point processes. 

Our application is in the area of linguistics, especially onomastics (the study 
of names), particularly place names. The naming of natural features, such as hills, 
lakes, springs, meadows etc., provides a wealth of information. Our example data 
consists of full information about place names in Finland. The names tend to 
be fairly old, and they provide information about the population history and 
linguistic conditions at the time when the names where given. 

Research in onomastics has traditionally been conducted by selecting a single 
name, or a group or related names, drawing maps of their occurrences, and doing 
qualitative analysis of the patterns of occurrences. Global analyses of the spatial 
distributions of different names are non-existent. 

Our case study concerns two major themes in onomastics. The first is depen- 
dence between occurrences of names. It has long been assumed that the name of 
a nearby location has an influence on the naming of a location. For example, if 
a lake is called “Black Lake” (usually because the water is sufficiently clear that 
one can see the dark bottom of the lake), then a nearby lake might be named 
“White Lake” . No quantitative evidence for this phenomenon is known, however. 
A special case of the local influence of names is repulsion: if a location is called 
B, then it makes sense to assume that other similar locations near this will not 
be called B: after all, the purpose of naming is to assign identifiers to locations. 
Our first goal is to study the local interactions between names. 

The second theme we want to verify is the existence of relatively homo- 
geneous spatial regions with respect to the distribution of place names. It is 
typically assumed that the naming conventions in nearby areas should be more 
or less similar, i.e., that there are clear regional trends in the style of names. 
The occurrence maps of individual names support this hypothesis, but virtually 
no global analyses exist. 

In this paper we address both these themes. We first show how one can 
modify the basic ideas of association rule techniques to obtain local descriptions 
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of the dependencies between the occurrences of names. The results show that 
indeed there are statistically significant associations between the occurrences of 
names. As for repulsion effects, we show that they are far less noticeable than 
expected. 

For the second theme, we demonstrate the use of mixture modeling for the 
data at the granularity of municipalities, and show that the resulting clusters of 
municipalities are spatially extremely coherent. Thus the results verify the basic 
hypothesis that spatial homogeneity exists and provide new data for further 
onomastic research into the naming processes that cause the phenomenon. 

The rest of this paper is organized as follows. The data set is described in 
Section 2. In Section 3 we show how the basic ideas of association rules can be 
generalized to the case of spatial point patterns, and give a sample of the results. 
Section 4 describes how mixture modeling applies to this data set, and discusses 
the results briefly. Section 5 is a brief conclusion. 

2 The Data Set 

Our example data set is a subset of the Finnish names occurring in the National 
Place Name Register, a part of the Geographic Names Register kept by the 
National Land Survey of Finland. The register contains all place names that 
appear on the 1:20 000 Basic Map and is maintained for the purposes of creating 
these maps. The size of the register, as well as that of our subsets, can be found 
in table 1, which shows the total number of Finnish names (or name instances), 
the number of different names, and the number of different municipalities in 
which these names are found. 



Table 1. National Place Name Register data 





Name 


Different 


Municipalities 




instances 


names 




Entire Register 


717 747 


303 626 


447 


Lakes 


58 267 


25178 


408 


Common lake names 


9 008 


54 


315 


Name endings 


55 538 


45 


407 



The full data model of the register is explained in [3], but for the present 
study it is sufficient to note that the register includes a language held, a fea- 
ture type held and the spatial information in different formats, including two 
co-ordinate systems and several administrative divisions. The feature type cate- 
gorizes geographical features into such classes as lake or pond, or river, or stretch 
of river, or forest. For lakes, the location is fixed to be a selected point inside of 
the lake. 

For our study we selected first all lake names in Finnish. This selection we 
pruned further along two different lines. For our primary data set we chose the 
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names that have at least 90 instances. While our aim was to concentrate on 
the most common names, the limit of 90 instances is somewhat arbitrary. To 
supplement the primary data set we selected for clustering purposes a second 
data set, consisting not of complete place names but of derivational suffixes and 
final parts of compound names. 

The two different subsets were selected mainly for onomastic reasons. Our 
working hypothesis was that spatial associations are in a large part related to the 
phenomenon of contrastive names — that is, pairs of names that refer to similar 
geographical features and differ only by the first part of the name in some sort 
of contrastive manner. To study this we needed to search for spatial associations 
for full names. Similarly, both intuition and onomastic consensus would say that 
there is a repulsion effect between two instances of the same name which is 
closely related to the use of place names to identify a place: a name cannot 
normally be used by the same group of people to denote two different places of 
the same type^. Again, this means we have to study the full names. In either 
case it seems appropriate to restrict ourselves to relatively common names, to 
make sure there are enough instances of each of them to get valid results. 

With clustering the situation is somewhat different. The obvious way to start 
is to use full names, like we do with the association rules, and there is no reason to 
doubt that this approach works. However, it is also reasonable to postulate that 
by studying word endings — both derivative suffixes and end-parts of compound 
names — we can get insight into differences in naming practices. Using name 
endings is thus an attempt to do cluster analysis based on the distribution of 
various name types, not just names as such. 

3 Spatial Association Rules 

In this section we consider the first theme: finding local effects between the 
occurrences of different names. As an example, consider Figures 1 — 3 showing 
the occurrences of certain pairs of names. How do the occurrences of one name 
affect the probability of occurrence of another name? It is fairly clear that the 
maps alone cannot answer the question. 

In spatial statistics questions such as this have been addressed by using, e.g., 
nearest neighbor distances or the K function and its derivatives [4,2]. Here we 
describe a similar approach, but using the terminology of association rules. 

Given a set of observations over 0-1 attributes Ai,...,A„, an association 
rule is an expression X ^ Y, where A, U C {Ai,...,A„}. Given a set X of 
attributes, the frequency f{X) of X is the fraction of observations that have a 1 
in all attributes of X . The frequency of the rule is defined to be f{X U U), and 
the accuracy (confidence) of the rule is f{X UY)/f{X). 

We consider spatial association rules of the form A B. The interpretation 
of such a rule is that given a location (x, y) in which event of type A occurs, one 
is likely to see at least one event of type B within distance r from {x,y). This 
definition is close to the ones used by [5, 6, 7,8, 9]. From an onomastic point of 

^ It is, however, relatively common to name e.g. a farm after a nearby lake. 
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view it seems prudent to start with restricting ourselves to associations between 
two names. 

To test the significance of a rule A =^r B we start with a set of places 
named A and another set of places named B. We want to evaluate whether the 
occurrence of a B is more likely in the context of a nearby A than in general. 
Note, however, that a B can only occur if there is a suitable natural feature 
present: we cannot observe a “Pike Lake” at position (x, y) unless there is a lake 
at (x,y). To take this into account we consider as a set of reference points all 
points belonging to the the same type of feature as B] call this set Cb- In our 
case we used all Finnish lakes as Cb- 

The probability that a given place that belongs to Cb is named B is P{B) = 
where N{B) is the total number of places named B and N{Cb) is the 
total number of all the places of the same type. We now select the places be- 
longing to set Cb which are within the given radius r of a place named A. We 
denote the size of this selection by n{CB.) and the number of B places in it by 
n{B). As null hypothesis we can now assume that the occurrences of A and B 
are independent. Under this hypothesis our selection can be viewed as a random 
sample, which can be approximated by the Poisson distribution, X ~ Poisson(A), 
where A = ti^Cb) N((jg) ■ To correct for multiple testing, we use the Bonferroni 
correction. 



Repulsion. Repulsion is essentially a special case of a spatial association rule 
A B, where A = B. However, in this situation we select points based on 
the spatial distribution of A; it is not immediately obvious that this can be 
considered a random sample with regard to A. We have therefore used another 
method to confirm the results on repulsion. 

In the general case we again start with two kinds of points, A and B, the 
latter of which belong to set Cb- The overall number of points B and Cs is 
N{B) and N{Cb), respectively; the probability of a given Cb point being a B 
point isp= 

Within a given radius of the ith point with name A there are n{CBi) points 
of set Cb- We use random variable Xi to denote the number of points named 
B in this set. If the B points are distributed independently of each other, Xi ~ 
Bin(n(C'_Bi),p), so E{Xi) = n(C'sJ and = n(CsJp(l— p). Summing, we 

obtain a variable Sm = by assuming independence of the variables 

Xi, we have E{Sm) = X]™ i E{Xi) and = X™ i E>‘^{Xi). Applying the 

central limit theorem we can obtain confidence estimates. 



Results. Applying the method presented above to the common names data set 
gave both expected and unexpected results. As expected, most of the pairs of 
names had no significant associations either way. Also to be expected was that 
there were pairs that had significant repulsion between the names: the spatial 
distributions of these names just don’t overlap, for various reasons related to 
such things as geography or variation in dialects. 
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One interesting sub-category of the association rules was what can be called 
contrasting names. These have traditionally considered only for such pairs as 
Mustalampi “Black Lake” — Valkealampi “White Lake” where the contrasting 
element in at least one of the names refers to a notable property of the lake and 
there is a clear antonymic relation between the two names. Our study indicates 
that this kind of variation is used in the naming process more widely and with 
far less strict semantic constraints for the elements than onomasticians have 
thought. For instance, there was a group of three names, Ahvenlampi “Perch 
Lake” , Haukilampi “Pike Lake” and Sdrkilampi “Roach Lake” , all of which had 
significant associations with each other even over small distances. Figure 1 shows 
the spatial distribution for Ahvenlampi and Haukilampi on a map with main 
dialectal regions, along with Poisson-approximated probabilities before and after 
the Bonferroni correction. 







Ahvenlampi => Haukilampi: 

+ At 1 km found 20; p(n<20) = 1.0000 {corrected 1.00) 

+ At 2 km found 40; p(n<40) = 1.0000 (corrected 1.00) 

+ At 3 km found 51; p(n<51) = 1.0000 (corrected 0.99) 

+ At 4 km found 75; p(n<75) = 1.0000 {corrected 1.00) 

+ At 5 km found 92; p(n<92) = 1.0000 (corrected 0.97) 

+ At 6 km found 116; p(n<116) = 1.0000 (corrected 0.98) 

+ At 7 km found 137; p(n<137) = 1.0000 (corrected 0.95) 

+ At 8 km found 170; p(n<170) = 1.0000 (corrected 1.00) 

+ At 9 km found 181; p(n<181) = 1.0000 (corrected 0.96) 

+ At 10 km found 204; p(n<204) = 1.0000 (corrected 0.98) 

Haukilampi => Ahvenlampi : 

+ At 1 km found 20; p(n<20) = 1.0000 (corrected 1.00) 

+ At 2 km found 40; p(n<40) = 1.0000 (corrected 1.00) 

At 3 km found 50; p(n<50) = 1.0000 (corrected 0.91) 

+ At 4 km found 75; p(n<75) = 1.0000 (corrected 0.99) 

At 5 km found 92; p(n<92) = 1.0000 (corrected 0.88) 

At 6 km found 113; p(n<113) = 0.9999 (corrected 0.73) 

At 7 km found 131; p(n<131) = 0.9996 (corrected 0.00) 

At 8 km found 154; p(n<154) = 0.9998 (corrected 0.53) 

At 9 km found 175; p(n<175) = 0.9999 (corrected 0.64) 

At 10 km found 195; p(n<195) = 0.9999 (corrected 0.80) 



Fig. 1. Spatial distribution of Haukilampi (x) and Ahvenlampi (-|-) 

There were, however, other pairs that would at first glance appear to be 
similarly contrasting, but whose associations are somewhat weaker and start 
to show at significantly longer distances. In fact, the question arises whether 
there is a connection in the naming process or whether the names just have a 
similar distribution. One such case is the pair of Joutenlampi “Swan Lake” and 
Hanhilampi “Goose Lake”, as shown in Figure 2. The reasons for the difference 
between this pair and that of Ahvenlampi — Haukilampi are not very obvious, 
and further onomastic study of these phenomena is needed. 

Then there are pairs of names that have a significant association but are not 
contrasting, like Lehmilampi “Cow Lake” and Likolampi “Retting Lake”^, as 
shown in Figure 3. In some cases another reason for the association can be seen; 
here, for instance, both names have similar agricultural origins. Although one 
can make such guesses about the reasons for the association, the phenomenon 
itself is a new discovery, and again further study would be strongly indicated. 

The name refers to a step in the processing of flax into linen. 
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Hanhilampi => Joutenlampi : 

At 1 km found 0; p{n<0) = 0.0000 (corrected 0.00) 

At 2 km found 3; p(n<3) = 0.9259 (corrected 0.00) 

At 3 km found 3; p(n<3) = 0.6418 (corrected 0.00) 

At 4 km found 5; p(n<5) = 0.6983 (corrected 0.00) 

At 5 km found 9; p(n<9) = 0.8927 (corrected 0.00) 

At 6 km found 18; p(n<18) = 0.9990 (corrected 0.00) 

At 7 km found 21; p(n<21) = 0.9985 (corrected 0.00) 

+ At 8 km found 31; p(n<31) = 1.0000 (corrected 0.98) 

At 9 km found 33; p(n<33) = 1.0000 (corrected 0.91) 

At 10 km found 37; p(n<37) = 1.0000 (corrected 0.91) 

Joutenlampi => Hanhilampi: 

At 1 km found 0; p(n<0) = 0.0000 (corrected 0.00) 

At 2 km found 3; p(n<3) = 0.8542 (corrected 0.00) 

At 3 km found 3; p(n<3) = 0.4347 (corrected 0.00) 

At 4 km found 5; p(n<5) = 0.4496 (corrected 0.00) 

At 5 km found 9; p(n<9) = 0.6805 (corrected 0.00) 

At 6 km found 20; p(n<20) = 0.9968 (corrected 0.00) 

At 7 km found 25; p(n<25) = 0.9981 (corrected 0.00) 

At 8 km found 33; p(n<33) = 0.9998 (corrected 0.49) 

At 9 km found 35; p(n<35) = 0.9990 (corrected 0.00) 

At 10 km found 40; p(n<40) = 0.9992 (corrected 0.00) 



Fig. 2. Spatial distribution of Hanhilampi (x) and Joutenlampi (+) 



Lehmilampi => Likolampi: 

At 1 km found 8; p(n<8) = 0.9998 (corrected 0.48) 

+ At 2 km found 21; p(n<21) = 1.0000 (corrected 1.00) 

+ At 3 km found 34; p(n<34) = 1.0000 (corrected 1.00) 

+ At 4 km found 45; p(n<45) = 1.0000 (corrected 1.00) 

+ At 5 km found 56; p(n<56) = 1.0000 (corrected 0.99) 

+ At 6 km found 69; p(n<69) = 1.0000 (corrected 0.99) 

+ At 7 km found 87; p(n<87) = 1.0000 (corrected 1.00) 

+ At 8 km found 104; p(n<104) = 1.0000 (corrected 1.00) 

+ At 9 km found 125; p(n<125) = 1.0000 (corrected 1.00) 

+ At 10 km found 143; p(n<143) = 1.0000 (corrected 1.00) 

Likolampi => Lehmilampi: 

At 1 km found 8; p(n<8) = 0.9999 (corrected 0.72) 




+ At 2 km found 19; p(n<19) = 
+ At 3 km found 30; p(n<30) = 
+ At 4 km found 37; p(n<37) = 
At 5 km found 44; p(n<44) = 
At 6 km found 48; p(n<48) = 
At 7 km found 61; p(n<61) = 
At 8 km found 72; p(n<72) = 
+ At 9 km found 85; p(n<85) = 



1.0000 (corrected 1.00) 
1.0000 (corrected 1.00) 
1.0000 (corrected 0.98) 
1.0000 (corrected 0.88) 
0.9991 (corrected 0.00) 
0.9999 (corrected 0.71) 
1.0000 (corrected 0.93) 
1.0000 (corrected 1.00) 



+ At 10 km found 93; p(n<93) = 1.0000 (corrected 1.00) 



Fig. 3. Spatial distribution of Lehmilampi (x) and Likolampi (+) 



The repulsion between different instances of the same name does not seem 
to be a very common phenomenon. Onomastically, this is rather surprising. It is 
true that our data set contains such names as Pahalampi “Evil Lake”^ (shown 
in Figure 4) or Palolampi “Burnt Lake”^, where there are no instances within 
2 km of each other. However, the area covered by such selections is rather small, 
and most of these findings cannot be considered significant. The repulsion effects 
are for the most part insignificant even without the Bonferroni correction. One 
possible explanation for the scarcity of significant repulsion is that the body of 
Finnish lake names is relatively large and the distance a name needs to retain 

® Some of these — possibly even a large amount — are euphemisms for a vulgar name 
that the locals considered too offensive to tell outsiders they perceived as being of a 
higher social standing, such as visiting onomasticians or geographers. 

These names are related to the agricultural method of burn-beating, practiced in 
some places in Finland until the early 20th century. 
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Pahalampi => Pahalampi : 

At 1 km found 0; p{n>0) = 0.7161 (corrected 0.00) 

At 2 km found 0; p(n>0) = 0.9899 (corrected 0.00) 

At 3 km found 8; p{n>8) = 0.6203 (corrected 0.00) 

At 4 km found 16; p{n<16) = 0.5101 (corrected 0.00) 

At 5 km found 41; p{n<41) = 0.9997 (corrected 0.17) 

+ At 6 km found 63; p{n<63) = 1.0000 (corrected 1.00) 

+ At 7 km found 76; p{n<76) = 1.0000 (corrected 1.00) 

+ At 8 km found 87; p{n<87) = 1.0000 (corrected 1.00) 

+ At 9 km found 103; p(n<103) = 1.0000 (corrected 1.00) 

+ At 10 km found 119; p{n<119) = 1.0000 (corrected 1.00) 



Fig. 4. Spatial distribution of Pahalampi 







Umpilampi => Umpilampi: 

At 1 km found 9; p{n<9) = 0.9999 (corrected 0.66) 

+ At 2 km found 32; p{n<32) = 1.0000 (corrected 1.00) 

+ At 3 km found 66; p{n<66) = 1.0000 (corrected 1.00) 

+ At 4 km found 82; p{n<82) = 1.0000 (corrected 1.00) 

+ At 5 km found 103; p{n<103) = 1.0000 (corrected 1.00) 

+ At 6 km found 126; p{n<126) = 1.0000 (corrected 1.00) 

+ At 7 km found 136; p(n<136) = 1.0000 (corrected 1.00) 

+ At 8 km found 154; p{n<154) = 1.0000 (corrected 1.00) 

+ At 9 km found 164; p{n<164) = 1.0000 (corrected 1.00) 

+ At 10 km found 171; p{n<171) = 1.0000 (corrected 1.00) 



Fig. 5. Spatial distribution of Umpilampi 



its usefulness as an identifier quite small: the name of a typical small lake is only 
used within a single village. The latter of these two factors may be sufficient to 
keep the repulsion small enough to disappear into the random variation caused 
by the former. 

With all this in mind, it is still somewhat surprising to find that there are 
cases like Umpilampi “Closed Lake”® (shown in Figure 5) where there is a visible 
association even at distances of 1 km or less. Again, one can guess for the reasons 
why this is possible — these are mostly small ponds, and in many cases the need 
to refer to one of them exists only within one farmer family — but nevertheless 
this would appear to contradict the onomastic consensus that the basic unit for 
name use in rural areas is one village. 



That is, a small lake overgrown with weeds. 



5 
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4 Probabilistic Modeling 



We now turn to the second onomastic theme, the existence or nonexistence of 
homogeneous regions with respect to place names. We tested this hypothesis by 
considering the municipalities as observations, and using mixture modeling and 
the EM algorithm to obtain a clustering of the municipalities. 

In more detail, we took the 315 municipalities, and created 54 variables, 
one for each of the names in the common names data set. This gives us 54- 
dimensional data set, where each column indicates the number of occurrences of 
the name in the municipality. We then took the 407 municipalities and 45 name 
endings, and conducted a similar test on that set. 

We use mixture modeling to this data set [10,11]. A (finite) mixture model 
assigns a probability P(x|6>) to an observation x as weighted sum P{x\9j) of 
component distributions P(x\9j) for j = 1, . . . , AT, where the weights (or mixing 
proportions) nj satisfy TTj > 0 and = 1. 

For each single component of the model for an observation x = (a;i, . . . , Xd) 
we assume independence between variables and use the multinomial Bernoulli 
distribution 

d 

P{x\9)^l[9r 

i=l 

with the constraint 9i = 1. A finite mixture of multivariate Bernoulli 

probability distributions is thus specified by the equation 

P(x|0) = TTjP{x\9j) = XI n 

j=l j = l i=l 



with the parameterization 9 = {tti, . . . , containing K(d + 1) parame- 

ters for data with d dimensions. 

Given a data set P with d binary variables and the number K of mixture 
components, the parameter values of the mixture model can be estimated using 
the Expectation Maximization (EM) algorithm [12,13,14]. The EM algorithm 
has two steps which are applied alternately in an iterative fashion. Each step is 
guaranteed to increase the likelihood of the observed data, and the algorithm 
converges to a local maximum of the likelihood function [12,15]. The method 
gives for each component and each observation a probability of the observation 
stemming from that component. 

We applied mixture modeling to the data described above; for each munici- 
pality X and component j we can compute the probability of the observation x 
stemming from component j by 






pm 

E^PN^^)' 



For most municipalities there is clearly one component j which gives the munic- 
ipality the highest probability. Example results are shown in Figures 6 and 7. 
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2 clusters 3 clusters 

Fig. 6. Clustering based on the most common 



4 clusters 
lake names 



The different clusters are shown in shades of grey; white municipalities have no 
lakes in the data set®. 

Several features are of interest. First of all, the clusters of municipalities 
obtained in this way are spatially very well connected. Note that the method 
in itself has no information about the locations of the municipalities, and hence 
the spatial connectedness of the clusters is interesting. Second, as the number of 
clusters increases, the existing cluster boundaries tend not to change very much, 
but rather existing clusters split. Third, the clusters obtained correspond fairly 
well with the previous onomastic information about the distribution of names. 

Specifically, in roughly the southernmost third of the map the boundary seen 
in the two-cluster maps corresponds rather well with the division between the 
eastern and western dialectal groups of Finnish. There is a small but noticeable 
deviation in Tavastland, and this is in line with our knowledge of the history of 
the settlement of Finland. Likewise, the western cluster continues north along 
the coast, and this too is in line with what we know from history. However, 
the middle third looks rather interesting: large regions that were designated and 
used as hunting grounds for the dialectally western Tavastland communities as 
late as the 16th century are not associated with the parent province but instead 
with the eastern regions, from where they were to a large extent populated in 
the 17th century. This would appear to imply that there is far less old influence 
in the names of that region than has been commonly believed, and this in turn 
opens up a variety of interesting onomastic questions. 



This is mostly because the common names data set contains only 15% of the lakes, 
but also because Finland is a bilingual country, and there are some municipalities 
that are uniformly Swedish. 
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Fig. 7 . Clustering based on the name ends 



5 Conclusions 

We have described a case study in the area of high-dimensional spatial point 
processes. We showed how one can use the basic principles of rule discovery and 
mixture modeling to analyze an onomastic data set about place names. The 
discovered rules of association and repulsion between names show fascinating 
local effects between the occurrences. The global analysis of name distribution 
by using mixture modeling demonstrated that homogeneous onomastic regions 
do exist. The methods lead to novel onomastic results. While the computational 
techniques we used are fairly standard, their application was not trivial. The 
global and local analysis of names has been shown to be very useful, and the 
study is continuing in several directions. 

The existing techniques can be used to answer many onomastic questions. 
While computational methods of this type have not been applied to onomas- 
tic data, the reactions of various researchers in that field have been promising. 
However, there are also computational open problems. Finding more complex 
local interactions between names is a particularly interesting one. If A and B 
occur close to each other, then C is likely to occur close, too. While straight- 
forward generalizations of association rules of the type AB C are possible, 
it might be more useful to investigate rules of the form B C, where T is a 

derived predicate of position, e.g., of the type “there are names of type a in the 
neighborhood” . 

A deeper issue is separating the different layers in the process leading to a 
particular name occurring in a particular location. In order for a lake at location 
{x,y) to be called “Black Pond”, there has to be a lake at that location, the 
people who named it must use words “black” and “pond” in their dialect, their 
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naming conventions must allow for the combined name to occur, etc. Thus the 

data actually is a produced by several interacting phenomena, and finding the 

influence of each is not easy. 
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Abstract. Constraint-based mining of sequential patterns is an active 
research area motivated by many application domains. In practice, the 
real sequence datasets can present consecutive repetitions of symbols 
(e.g., DNA sequences, discretized stock market data) that can lead to a 
very important consumption of resources during the extraction of pat- 
terns that can turn even efficient algorithms to become unusable. We 
propose a constraint-based mining algorithm using an approach that en- 
ables to compact these consecutive repetitions, reducing drastically the 
amount of data to process and speeding-up the extraction time. The 
technique introduced in this paper allows to retain the advantages of 
existing state-of-the-art algorithms based on the notion of occurrence 
lists, while permitting to extend their application fields to datasets con- 
taining consecutive repetitions. We analyze the benefits obtained using 
synthetic datasets, and show that the approach is of practical interest 
on real datasets. 

Keywords: constraint-based mining, sequential pattern, generalized oc- 
currence 



1 Introduction 

Sequential pattern mining has been introduced in 1995 [1]. It concerns pattern 
discovery (e.g., regularities) from ordered data, typically sequence databases. 
It has many applications, e.g., customer purchase analysis, Web Usage Mining, 
DNA sequence analysis. Looking for efficient algorithms has received a lot of 
attention (e.g., [8,11,9,5,10,12,14,13]). Each of these algorithms has its own pros 
and cons. Their efficiency depends on the characteristics of the data and on the 
kind of user-defined selection criteria, i.e., the constraints that must be satisfied 
by the extracted patterns. Several available algorithms are based on the so- 
called occurrence lists, i.e., lists that contain the location of the patterns in the 

* This research is partially funded by the European Commission 1ST Programme - 
Future and Emergent Technologies, cInQ project (IST-2000-26469). 
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data. This technique has been proved very useful for frequent pattern extraction 
(e.g., [8,12,14,3,13]). 

Independently, the use of user-defined constraints to reduce the search space 
during sequential pattern extraction has been developed (e.g., [11,9,4,2]). In- 
deed, it has also been integrated in the occurrence list approach in the cSpade 
algorithm [13], resulting in one of the most efficient algorithms proposed for 
constraint-based mining of sequential patterns. 

We have two main application domains for which we need efficient sequential 
pattern algorithms: financial data (stock market data) analysis for CDC (a ma- 
jor financial company in France) and DNA sequence database analysis. When 
considering the cSpade approach on these data, we understood that the ben- 
efits of the use of occurrence lists are lost when mining sequences containing 
consecutive repetitions of symbols. It comes from an explosion of the number of 
occurrences due to the repetition of the symbols. We recently proposed to han- 
dle efficiently the repetitions in the occurrence lists [7] when considering only a 
minimal frequency constraint. In this paper, we present how to generalize the 
notion of occurrence to perform efficient constraint-based mining on collections 
of sequences that contain repetitions. From a practical point of view, this leads 
to a technique that retains the advantages of the cSpade approach, while being 
able to address efficiently a broader scope of applications. The key idea is to use 
a single generalized occurrence to represent several occurrences while keeping 
enough information for the mining process. 

This paper is organized as follows. Section 2 recalls the constraint-based 
sequential pattern mining problem and gives an abstract formulation of an algo- 
rithm for sequential pattern mining using occurrence lists. The notion of general- 
ized occurrence is introduced in Section 3, and the corresponding modifications 
of the mining algorithm is presented. The practical impact of the use of gener- 
alized occurrences is demonstrated by means of experiments in Section 4. We 
conclude in Section 5. 

2 Problem Statement and Abstract Algorithm 

2.1 Constrained Sequential Pattern 

The problem is to mine all frequent sequential patterns, verifying some user- 
defined constraints, that can be found in a sequence database. The constraints 
considered in this paper are the so-called minimum and maximum gap con- 
straints, that enable to specify the minimum or maximum time interval between 
the occurrences of two events inside a pattern. Another similar constraint con- 
sidered is the time window constraint, that enables to limit the maximum time 
between the first event and the last event of a pattern. Basically, the problem can 
be presented as follows: Let I = {ii,i 2 , ■ ■ ■ , fm} be a set of m distinct items. An 
event (also called itemset) of size I is a non empty set of I items from I : {iii 2 -..ii). 
A sequence a of length L is an ordered list of L events Oi, . . . , denoted as 
Q!i —>■ 02 Q;l. A database is composed of sequences, where each se- 

quence has a unique sequence identifier {sid) and each event of each sequence 
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has a temporal event identifier (eid) called timestamp. For a sequence in the 
database, each eid associated to an event is unique and if an event precedes 
event ej in a sequence, then the eid of ej must be strictly greater than the eid 
of Cj. A sequential pattern (or pattern) is a sequence. Due to the lack of space, 
we considered only single-item events in patterns, that is patterns composed of 
events of size 1. The extension to pattern composed of events of size greater than 
1 is straightforward and can be found in an extended version of the paper [6] . 

We are interested in the so-called constrained sequential patterns defined 
as follows. A sequence — >■ «2 is called a subsequence of 

another sequence Sf, = —>■ a'^ if and only if there exists integers 

1 < A < < . . . < in < TO such that ai C «2 C a'^, . . ., C a'^. 

Let supMin be a positive integer called absolute support threshold, a pattern p 
verifies the minimum frequency constraint in a database D if p is a subsequence 
of at least supMin sequences of D. In this paper, we also use interchangeably the 
relative support threshold expressed in the percentage of the number of sequences 
of D. Let gapMin be the fixed value of the minimum gap constraint. A pattern 
p = oi — 02 • —1' Q;„ verifies the minimum gap constraint if and only if, for 

all Oj, i = 1 . . . n— 1, eid{ai+i) — eid{ai) > gapMin. Similarly, let gapMax be the 
fixed value of the maximum gap constraint. Pattern p verifies the maximum gap 
constraint if and only if, for all i = 1 ... n— 1, eid(a;i+i) — eid{ai) < gapMax. 
Now, let winMax be the fixed value of the time window constraint. Pattern p 
verifies this constraint, if and only if eid(an) — eid(ai) < winMax. 

2.2 Abstract Mining Algorithm 

We present in this section an abstract algorithm corresponding to the general 
principle used in algorithms based on the use of occurrence lists for mining 
sequential patterns (e.g., [8,12,14,3,13]). The algorithm repeats two operations: 
a generation of candidate patterns and a support counting step. Let us introduce 
some needed concepts. A pattern with k items is called a k-pattern. A prefix of a 
fc-pattern z is a subpattern of z constituted by the k — 1 first items of z and its 
suffix corresponds to its last item. We extend the notion of prefix and suffix to 
occurrence. Let y = ci —>■ C 2 ek-i — >■ be an occurrence of a fc-pattern 

z, then prefix{y) = ei —>■ C 2 ek-i and suf fix{y) = Ck- 

The algorithm uses two frequent fc-patterns zi and Z 2 having the same (fc— 1)- 
pattern as prefix to generate a (fc -I- l)-pattern z. This operation is denoted as 
merge{zi, Z 2 ) and generates a single fc-pattern: z = Zi — >■ suf fix{z 2 ). The sup- 
port counting for the newly generated pattern is not made by scanning the whole 
database. Instead, the algorithm has stored in specific lists, called occLists, the 
positions where zi and Z 2 occur in the database. It then uses these two lists de- 
noted occList(zi) and occList{z 2 ) to determine where z occurs. Then occList{z) 
allows to compute directly the support of z, by counting the number of distinct 
sids present in this list. The computation of occList{z) is a kind of join and is 
denoted join{z\, Z 2 ). The abstract algorithm is presented as Algorithm 1. 
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Algorithm 1 (Abstract Mining Algorithm) 

Input: a database of sequences and a sup- 
port threshold. 

Output: the frequent sequential patterns 
contained in the database. 

Use the database to compute: 

- Fi the set of all frequent items 

- occList(z) for all element z of Fi 
let i := 1 

while Fi ^ ll) do 
let Fi+i := 0 
for all 2i G Fi do 
for all 22 £ Fi do 

if 2i and zi have the same prefix then 
let 2 ;= merge{z\,Z2) 

let occList(z) := join{occList(z\),occList{z2)) 
Use occList(z) to determine if z is frequent 
if 2 is frequent then 
Fi+i := Fi+i U {2} 

fi 

fi 

od 

od 

i ;= i + 1 

od 

output Ui<j<i Fj 



Fig. 1. Abstract mining algorithm using occurrence lists. 



3 Generalized Occurrences and GoSpec Algorithm 

3.1 Constrained Generalized Occurrences 

The structure of a constrained generalized occurrence list is designed to reduce 
the size of the occurrence lists by representing several occurrences with a single 
more general one. In case of data presenting consecutive repetitions of items, 
this leads to an important gain in term of memory space used, and since the 
lists proceeded by the join operation are shorter, it results also in the reduction 
of the overall execution time. 

For example, let us consider the following toy database containing three se- 
quences. In these sequences the events are located at consecutive timestamps 
(i.e., 1,2,3, . . . ) and each sequence begin at timestamp 1. 

Sequence 1: 

{A}, {A}, {A}, {A}, {A}, {B}, {B}, {B}, {B, C}, {B, C}, {B, G}, {B, C}, 
{B},{B},{B} 

Sequence 2: 
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{B}, {A B}, {A, B}, {B}, {B, C}, {B, C}, {B, C}, {B, C}, {B, C}, {B, C}, 
{C},{C},{C},{C} 

Sgcj\ighc0 3* 

{}, {A}, {}, {B}, {B}, {B}, {B}, {B, C}, {B, C}, {C}, {C}, {C}, {C}, {C} 

A classical representation of occurrence lists like the one used by cSpade [13] 
is depicted in Figure 2, in the left tables of each three areas. These tables rep- 
resent the occurrence lists of cSpade for patterns A, B, C, A — >• B, A — >• C and 
A — >■ B — C, with supMin = 2, gapMin = 2, gapMax = 5 and winMax = 10. 
In the tables, the column sid corresponds to the identifier of the sequence in 
which the pattern occurs, eid corresponds to the timestamp of the last event of 
this occurrence, and dzjf corresponds to the difference between the timestamps 
of the first and the last event of the occurrence (used by cSpade to check the 
time window constraint). 

We propose a notion of constrained generalized occurrence (generalized occur- 
rence for short) to compact such consecutive occurrences. This notion is straight- 
forward for pattern of size 1, but not so trivial for longer patterns since it has 
to enable the handling of the various constraints. For a pattern z, the form of a 
generalized occurrence is {sid,tBeg, [minTmax], gmax) , and contains: 

— An identifier sid that corresponds to identifier of a sequence where pattern 
z occurs. 

~ A timestamp tBeg that corresponds to the timestamp of an occurrence of 
the first event of the pattern z (the detailed construction of tBeg will be 
given in Algorithm 2). 

— An interval [min^max] corresponding to eids of consecutive occurrences of 
the last event of pattern z. 

— A value gmax that indicates the timestamp of the last occurrence of the last 
event of pattern z respecting the gapMax constraint. If no such occurrence 
exists then gmax is set to —1. 

Examples of generalized occurrences for the toy database are given in Fig- 
ure 2, in the right tables of each three areas. In the case of pattern B, it is 
possible to reduce its 10 consecutive occurrences in the first sequence to a single 
generalized occurrence (1,6, [6, 15], 15), where the interval [6,15] compacts all 10 
eids. It should be noticed that for patterns of size 1 the fields tBeg and gmax 
are useless. However this is not the case for longer patterns. For example, let 
us consider the last generalized occurrence of the constrained generalized oc- 
currence list of pattern A — >■ B. This generalized occurrence is (3, 2, [4, 9], 7), 
indicating that it appears in sequence 3 and starts at timestamp 2. The interval 
[4,9] means that it represents several occurrences ending from 4 to 9. The gmax 
value of 7 notifies that occurrences ending from 4 to 7 satisfy the maxGap con- 
straint, while for occurrences ending strictly after timestamp 7 only the prefix 
of the occurrence satisfies maxGap. 

In the case of a generalized occurrence that does not represent any occurrence 
that satisfy the maxGap constraint for all its events, but that represents only 
occurrences satisfying this constraint up to this its last event, then the gmax 
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Fig. 2. Occurrence lists vs. Generalized occurrence lists for patterns A, B, C, A — >■ B, 
A — >■ C and A — >■ B — >■ C, with supMin = 2, gapMin = 2, gapMax = 5 and winMax = 
10 . 



value is set to -1 (as for example in the generalized occurrence (3,2, [8, 12], — 1) 
of pattern A — >■ C in Figure 2). 

3.2 Dedicated Join Algorithm 

The GoSpec Algorithm is an instance of the abstract algorithm 1 using a join 
designed for the generalized occurrence lists. 

The join process is called when the merge operation has been done. It com- 
putes the constrained generalized occurrence list of a candidate pattern z, from 
the occLists of two generator patterns zi and Z 2 having the same prefix. 

Two different procedures are called depending on the level of the extraction 
process, JoinLevel 2 (Algorithm 4) and Join (Algorithm 3). The first one is 
a specific algorithm dedicated to the particular case of a 2-pattern candidate 
and the second one to the general case of a k-pattern candidate with k > 2. 
These two algorithms use a common function, LocalJoin (Algorithm 2), that 
computes a generalized occurrence v = {sid,tBeg, [min, max\, gmax) of z from 
a single generalized occurrence of zi and a single generalized occurrence of Z 2 - 

The LocaZ Jom(Algorithm 2), first verifies that the input generalized occur- 
rences satisfy necessary conditions to be joined, performing the tests of line 
3 and that the two generalized occurrences are from a same sequence, that is 
sidi = sid 2 (line 4). One line 3, the first comparison verifies that there exists 
at least one suffix of an instance of {sid 2 ,tBeg 2 , [min 2 ,max 2 ], gmax 2 ) that fol- 
lows the first suffix of an instance of {sid\,tBegi, [mini, max i],gmaxi) and that 
satisfies the gapMin constraint. The second comparison checks that there ex- 
ists at least one suffix of an instance of {sid 2 ,tBeg 2 , [min 2 , max 2 ], gmax 2 ) that 
satisfies the winMax constraint wrt. tBeg\. The last comparison ensures that 
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Algorithm 2 (LocalJoin) 

Input: Two generalized occurrences 
{sidi , tBegi , [mini , maxi], gmaxi) 
and {sid2,tBeg2, [min2,max2], gmax2) 

Output: (v,add), where v = (sid,tBeg, [min, max], gmax) and add is 
a boolean value that is false if v cannot be created. 

1. let add := false 

2. let V ;= null 

3 . if {mini + gapMin < max2) and {tBegi + winMax > mm2) 

and {mini < gmaxi)t\\en 
4- if (sidi = sid2) then 

5 . let sid ;= sidi 

6 . let tBeg := tBegi 

7 . find min the minimum element x of [mm2, max2] 

such that X > mini + gapMin 

8. find max the maximum element x of [mm2, max2] 

such that X < tBegi + winMax 

9 . find gmax the maximum element x of [mm2, max2] 

such that X < gmaxi + gapMax 

10. fi 

11. if (min and max exist) and {min < max) then 

12. if (gmax not exists) then let gmax := —1 

13 . else if (gmax > max) then 

14. let gmax := max fi 

15 . fi 

16 . let V := (sid, tBeg, [min, max], gmax) 

17 . let add true 

18 . R 

19 . R 

^d.output (v,add) 



Algorithm 3 {.Join) 

Input: occList(zi) and occList(z2), generalized occurrence lists 
of two patterns that share a same prefix. 

Used subprograms: Algorithm 2 

Output: a new occList 

Initialize GoIdList to the empty list. 

1. for all occi G GoIdList (zi) do 

2 . for all 0CC2 G GoIdList (Z2) do 

3 . let (v,add) ;= LocalTemporalJoin(occi, 0CC2) 

4. if add then 

5 . Insert v in occList 

6. R 

7. od 

8. od 

P. output occList 



Fig. 3 . Local.] oin and Join algorithms. 
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{sidi,tBegi,[mini,maxi],gmaxi) has at least one instance that satisfies the 
gapMax constraint. 

Lines 5 to 9 generate a new generalized occurrence, min is the timestamp 
of the earliest suffix of an instance of {sid2,tBeg2,[min2,max2\, gmax2) that 
follows the earliest suffix of an instance of {sidi,tBegi, [mini,maxi], gmaxi) and 
that verifies the minimum gap constraint. In a same way, max is the timestamp of 
the latest suffix of an instance of {sid2, tBeg2, [min2, max2 ] , gmax2) that verifies 
the time window constraint wrt tBeg\. gmax indicates the timestamp of the 
latest suffix of an instance of {sid2,tBeg2, [min2, max2], gmax2) that can form 
an occurrence of v that verifies gapMax. 

This LocalJoin algorithm is called by Join (Algorithm 3 ) that generates a 
new occList from the occLists of two generator patterns zi and Z2- The Algo- 
rithm 3 iterates on the elements of occList{z\) and occList{z 2 ). For each pair 
(occi,occ2) a new constrained generalized occurrence is generated when possible 
using LocalJoin. Algorithm 3 is the general join operation used for /c-patterns 
when k > 2. A dedicated join is needed to generate the occurrence lists of 2 - 
patterns (i.e., z\ and Z2 contain a single item. It is called JoinLeveh and is 
presented as Algorithm 4 . Contrarly to the general Join, JoinLevel 2 performs 
several calls to the LocalJoin procedure. Indeed, the instances of the gener- 
alized occurrence {sidi,tBegi,[mini,maxi],gmaxi) must be proceeded sepa- 
rately because they correspond, in the data, to different starting timestamps 
of the 1 -pattern z\. Thus, several calls are made on all generalized occurrences 
{sid\,p, [p,p],p) with p varying between the values mini and max\. 

Proofs of the correctness of the representation using generalized occurrences 
(and the corresponding join process) can be found in [ 6 ]. 



Algorithm 4 {JoinLeveh) 

Input: occList (zi), occList (Z 2 ) 

Used subprograms: Algorithm 2 
Output: a new occList 

Initialize occList to the empty list. 

1. for all (sidi,tBegi,[mini,maxi], gmaxi) € occList(zi) do 

2. for all (sid 2 ,tBeg 2 ,[min 2 ,max 2 ], gmax 2 ) € occList(z 2 ) do 

3. for all p £ [mini , maxi] do 

4. let (v,add) LocalTemporalJoin({sidi,p,[p,maxi],p) , 

{sid2,tBeg2, [min2,max2], gmax2) ) 

5. if add then 

6. Insert v in occList 

7. fi 

8. od 

9. od 
10 . od 

I i. output occList 



Fig. 4. JoinLeveh algorithm. 
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4 Experimental Results 

In this section, we present experimental results and compare the behaviors of 
GoSpec and of cSpade [13] (one of the most efficient algorithm proposed in the 
literature and based on occurrence lists). 

Both algorithms have been implemented using Microsoft Visual C-|— I- 6.0, 
with the same kind of low level optimization to allow a fair comparison. All 
experiments have been performed on a PC with 196 MB of memory and a 500 
MHz Pentium III processor under Microsoft Windows 2000. 

4.1 Experiments on Synthetic Datasets 

The synthetic dataset has been generated using the Dataquest generator of 
IBM [1] and the following parameters: C10-T2.5-S4-I1.25-D1K over an alpha- 
bet of 100 items (called setl). It contains 1000 sequences with an average size of 
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10 events per sequences (see [1] for more details on the generator parameters). 
In this dataset, the time interval between two time stamps is 1, and there is one 
event per time stamp. 

In order to have datasets presenting parameterized consecutive repetitions on 
certain items, we performed a post-processing on set! to add such repetitions. 
Each item founded in an event of a sequence has a probability fixed to 10% to 
be repeated. When an item is repeated, we simply duplicated it in the next i 
consecutive events. If the end of the sequence is reached during the duplication 
process the sequence is not extended (no new event is created) and thus, the 
current item is not completely duplicated. We denote setl_r{z} the dataset ob- 
tained with a repetition parameter of value i. For the sake of uniformity, set! is 
denoted setl_r0. The post-processing on setl^rO leads to the creation of 5 new 
datasets setl_rl,. . ., setl_r5. 

The three first graphs (top-left, top-right and middle-left) of Figure 5 show 
the results of the extractions performed on datasets setljrl, setl_r2, . . ., setl_r5 
with the following constraints: a support threshold fixed to 2.5%, a window time 
limited to 6, a minimum gap fixed to 2 and a maximum gap fixed to 4. 

The top-left graph (Figure 5) gives the size of the cSpade and GoSpec occur- 
rences lists (in number of elements) for extraction performed on files setl-rl, . . ., 
set5_r5. As expected, the total number of occurrences used by cSpade is greater 
than the number of constrained generalized occurrences used by GoSpec, and 
this reduction increases with the number of repetitions. The top-right graph 
shows that this reduction has a direct impact on the join costs (in term of num- 
ber of calls to LocalJoin), that results on an important reduction of the total 
execution time of the extractions as shown in the middle- left graph of Figure 5. 

The middle-right graph of Figure 5 completes these results with the extrac- 
tion times on datasets set\jrQ and setljrb. It shows that the execution time to 
find a given number of patterns remains quite the same in presence of repetitions 
for GoSpec. 

4.2 Experiments on Real Datasets 

The first real dataset is a financial dataset provided by the CDC financial com- 
pany (Caisse des Depots et Consignations) and contains the variations of stock 
prices over one year. The discretized data results in a set (called set2) of 2830 se- 
quences with an average length of 15 events per sequence. These sequences have 
been built from an alphabet of 17 items. The extractions have been performed 
using the extended version of the algorithm ( [6]), that is without any limitation 
on the number of item per event composing the generated patterns. The follow- 
ing constraints have been used: winMax =10, maxGap = 4 and minGap = 2. 
The bottom-left graph of Figure 5 represents the total execution time of both 
cSpade and GoSpec for minimum support thresholds varying from 25% to 50% 
and shows that GoSpec offers a significant gain wrt. cSpade. 

The second real dataset corresponds to a dataset of DNA sequences called 
sets. It contains 1778 sequences with an average length of 102 events composed 
by only one item per event over the nucleic alphabet {A,T,G,C}. The extractions 
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have been performed using a window time constraint sets to 6, a maximum gap 
constraint of 3, no minimum gap constraint, and a minimum support threshold 
varying from 5% to 25%. The bottom-left graph of Figure 5 illustrates the total 
execution time used by the extractions and shows the advantages of GoSpec in 
practice on this second kind of data. 

5 Conclusion 

In this paper we presented an algorithm that enables to manage efficiently the 
constraint-based mining task when the sequential databases contain consecutive 
repetitions of their items. Such a situation can appear in several domains (e.g., 
discretized quantitative time series and DNA sequences). This can cause an 
explosion of the number of pattern occurrences and thus to an important loss of 
efficiency for algorithms based on an occurrence list approach (e.g., [8,12,14,3,13], 
while this algorithm family has shown its interest in many situations (e.g., low 
support mining and active constraint handling)). The algorithm presented in this 
paper, extends this family to tackle with these domains. It is based on the notion 
of constrained generalized occurrences, that have the particularity to compact 
several consecutive occurrences of patterns while keeping enough information for 
a constraint-based mining process. We showed by means of experiments , that 
the gain in term of memory space and execution time is important and that 
it increases with the number of consecutive repetitions contained in the input 
sequences. 
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Abstract. We introduce the subspace difference metric, a novel hetero- 
geneous distance metric for calculating distances between points with 
both continuous and (unordered) categorical attributes. Our approach 
is based on the computation and comparison of characteristic subspaces 
(i.e. contexts) for each of the symbols and can be viewed as a general- 
ization of the well-known value difference metric. 

Subsequently, as one possible extension, we propose a linearization of 
the computed symbolic distances by multidimensional sealing, thereby 
mapping a set of symbols onto the interval [0, 1]. Thus, even algorithms, 
which have originally been designed for usage with continuous attributes 
(e.g. clustering algorithms like k-means), may be applied to datasets 
containing discrete attributes, without having to adapt the algorithm 
itself. 

Finally, we evaluate the proposed metric and the linearization in quan- 
titative and qualitative settings and exemplify the applicability in clus- 
tering domains. 



1 Introduction and Motivation 

Many inductive algorithms in machine learning and data mining make strict as- 
sumptions on the attribute types of the database. On the one hand, algorithms 
for dealing with continuous data may naturally utilise nearness measurements 
and exploit the metric properties of the instance space. On the other hand, 
however, restricting a knowledge structure to categorical data allows for more 
“exact” induction methods (like association rule or functional dependency min- 
ing), because, intuitively, categorical data does not contain that kind of inherent 
“fuzziness” which continuous data usually exhibits. 

Many learning algorithms, most notably instance based techniques, neural 
networks and some of the most widely used clustering methods, necessitate all 
attribute types to be continuous - in these cases categorical or nominal data or 
missing values can often not be handled appropriately. To overcome these diffi- 
culties, in the context of classication learning, the value difference metric (VDM, 
refer to [SW86] or [CS93]) or one of its generalizations [WM97] has been used 
to good effect. Furthermore some interesting approaches for clustering heteroge- 
neous data (i.e. datasets with mixed continuous and categorical attributes) have 
been published in the recent years. 
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Some of these novel approaches can be regarded as self-contained methods 
for clustering: E.g., [GKROO] and [ZFCHOO] describe iterative clustering ap- 
proaches based on dynamical systems, [SCCOO] use a generalized notion of en- 
tropy, [GRSOO] propose a concept of links to measure similarity and [GGR99] 
introduce a summarization-based algorithm. Additionally, extensions for some 
well-known clustering algorithms have been developed: E.g., [NH98] describe 
incremental and sparse variants of the EM algorithm and [GRB99] develop a 
discrete KMeans algorithm. 

Whereas the problem of transforming continuous attributes to discrete types 
for the application of symbolic algorithms is well-known and usually termed 
discretization^ , none of the aforementioned methods solves the inverse problem: 
transforming discrete symbols to ordered (continuous or nominal) types. They 
either represent complete self-contained clustering procedures with little or no 
affinity to any continuous clustering scheme or require a rewriting of the distance 
metric used by the clusterer. 

In this paper we propose the heterogeneous subspace difference metric 
(HSDM), a novel heterogeneous distance metric for computing distances be- 
tween points with both continuous and categorical attributes. The basic idea 
is the computation and comparison of characteristic subspaces for each of the 
symbols, thus, the HSDM can be viewed as a generalization of the HVDM {het- 
erogeneous value difference metric). 

Furthermore, for usage as a pre-processing step, we propose a linearized ex- 
tension of the SDM component, the linearized subspace difference metric (ISDM), 
which induces a strict ordering of the involved symbols. For this task we make 
use of multidimensional scaling (MDS) to map a set of symbols onto a one- 
dimensional scale, given high-dimensional distance measurements calculated by 
the SDM. Thus, even algorithms, which have originally been designed for us- 
age with continuous attributes, may be applied to datasets containing discrete 
attributes, without having to adapt the algorithm itself. 

2 Categorical Metrics 

As [GGR99] note, distance functions on categorical attributes are not naturally 
defined, because it is difficult to reason that, e.g., “one color is ’like’ or ’un- 
like’ another color in a way similar to real numbers.” This is due to the fact 
that (unordered) categorical attributes typically do not contain any information 
other than the symbols themselves. Whereas with continuous numbers various 
calculations and pairwise comparisons can be performed, usually all we can say 
about two colors, is whether they are equal or not. 



^ Some induction algorithms can be viewed as being able to discretize “on the fly” (e.g. 
C4.5), however also several methods for automatic pre-discretization have appeared 
in print: [DKS95] give an excellent overview, while [LudOO] present an nnsupervised 
multivariate approach. 
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2.1 Overlap and VDM 

This, then, is also the most widely used (e.g. [AKA91]) method for comparing 
two symbols: For calculating the distance between two instances with mixed 
continuous and categorical values, the following heterogeneous euclidean overlap 
metric (HEOM, refer to [WM97]) is used: 



HEOM{x,y) 



\ 



y^^di{xj,yi) 



2 = 1 



2 



Here, x and y are instance vectors, m is the number of attributes and di is the 
following function: 



di{x,y) 



overlap{x, y) 
\x-v\ 

rangci 



if cc or t/ is missing 

if attribute Ai is categorical 

otherwise 



The following simple overlap function is used: 



overlap{x, y) 



0 ii x = y 

1 otherwise 



Clearly, the HEOM is overly simplistic in handling categorical attributes and al- 
though it may be appropriate in some cases, its use can lead to poor performance 
[CS93]. 

A more sophisticated alternative, the value difference metric (VDM), intro- 
duced by [SW86], in most cases provides a better distance measurement for 
categorical attributes: Basically, it consists of considering two symbols to be 
similar, if they make similar predictions. The following definition is a simplified 
version without weighting terms [Dom96] : 



SVDM{x,y) = ^ \p{a\x) - p{cf\y)\‘' 

i=l 

where c is the number of classes, p{ci\x) is the conditional probability that the 
output class is Ci given the input value x and g is a constant (usually 1 or 2). This 
categorical distance function can then be used as a replacement for the overlap 
function, yielding the heterogeneous value difference metric (HVDM, refer to 
[WM97]). 

As [Dom96] note, the SVDM “attenuates [the problem of sensitivity to irrel- 
evant attributes] for symbolic attributes, as long as a large number of examples 
is available, [...] due to the fact that by definition p{ci\xj) will be roughly the 
same for all values Xj of an irrelevant attribute, leading to zero distance between 
them.” However, the SVDM is obviously only applicable in cases, where class 
values are available for all instances (e.g. classification tasks). 
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2.2 The Subspace Difference Metric (SDM) 

In situations, where a class attribute is not readily available (e.g. unsupervised 
learning), the VDM obviously cannot be used by definition. In such cases one 
might be tempted to fall back on the simpler overlap function to compute hetero- 
geneous distances. With datasets, where the majority of attributes is continuous, 
this procedure might be appropriate - or at least not too harmful -, but when 
many or all of the attributes are of categorical type, much information about 
the distribution of the symbols will be lost thereby. 

Rethinking the first paragraph in section 2, we realize that it is not exactly 
true. Actually, we do have more information about the symbols of a categorical 
attribute: we know (or can compute) the distribution of each symbol within 
the instance space, i.e. we know what values in other attributes each symbol 
co-occurs with. And this is exactly the information that we need to be able to 
argue that, say, the color red is more similar to orange than to blue. 

Intuitively, to be able to argue about the pairwise similarity of two symbols, 
we have to make the assumption that the symbols do not exactly partition the 
domain space of another attribute. E.g. to be able to say that red is more similar 
to orange than to blue, we could argue that there are more instances in the 
dataset which can be red or orange than there are instances which can be red 
or blue. I.e. the context, in which the color red occurs overlaps more with the 
context of the color orange than with that of the color blue. 



Characteristic Subspaces. The ideal tool for computing the overlaps be- 
tween such contexts would be characteristic rules or the characteristic subspaces 
induced by them. Formally: 

Definition 1. Let Ai represent the i-th attribute (i £ {!,..., m} ) and Vibe a set 
of values from the domain of attribute i. A characteristic rule is an implication 
rule of the form 

Ai = X ^ Ai G Vi A • • • A Am £ Vm 

where the attribute Ai does not occur on the right hand side. A characteristic 
rule is said to hold with confidence p, formally 



Ai — X -^p Ai G Vi A • • • A Am £ Vm 

if with probability at least p, whenever condition Ai = x holds for an instance, 
the right hand side holds as well. 

A characteristic rule is basically an association rule, where the left hand side is 
restricted to only one condition. 14 is a discrete set of values, if Ai is categorical, 
and a continuous interval, if Ai is continuous. 

Usually characteristic rules are constructed for values of a designated class 
attribute, we may, however, induce such rules for symbols from any categorical 
attribute. Thereby we obtain information about the context in which a certain 
symbol occurs - the subspaces induced by these rules will in the following be 
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called characteristic subspaces. Intuitively, a characteristic subspace gives an ex- 
tensional description of a symbol and by comparing these spaces we could argue 
about the similarity of the symbols. 



Projected Characteristic Subspaces. Unfortunately, computing such char- 
acteristic subspaces with high confidences would necessitate running a charac- 
teristic rule induction algorithm for each single symbol, which is obviously com- 
putationally infeasible. For rules with high confidence it would, e.g., not suffice 
to compute the projections of a symbol onto each of the other attributes and 
combine them, because the cartesian product of two or more highly probable 
regions in different attributes need not necessarily be highly probable as well, 
formally: 



Ai — X -^p (Aj G Vj), Aj — X ->-p (A/^ G Vk) ^ Ai — X -^p (Aj G Vj A A^. G V^) 

However, in our case, we may relax the requirements, in that we actually need 
not construct full characteristic subspaces. For the comparison of two symbols 
it does suffice to compare the projections onto each of the other attributes. We 
call the resulting spaces projected characteristic subspaces {pc- sub space s) . 

Definition 2. With Di representing the domain of attribute Ai, let the discrete 
domain Di be defined as follows: 



D, 



Di if attribute Ai is categorical 

{1, . . . , d} if attribute Ai is continuous, for d G N 



To construct the discrete domains of the attributes, we have to discretize the con- 
tinuous attributes by some simple unsupervised discretization method. Prefer- 
ably we should use equal-width discretization for this task, because this method 
visualizes the distribution of the values within the interval (in the experiments 
we chose d = 6 as standard setting) . 

Definition 3. Let pj be the discrete probability density of the values from the 
discrete domain Dj of attribute j and let Pj,Ai=x be this discrete probability 
density under the condition Ai = x (i.e. Pj,Ai=x{y) = Pj{y\Ai = x)). Then the 
probabilistic projected characteristic subspace (ppc-subspace) of the symbol x in 
attribute Ai is defined as the collection of all densities pj^Ai=x for j i. 



Definition 4. With m being the number of attributes, x and y symbols from 
a categorical attribute Ai and q G N, the subspace difference metric (SDM) is 
defined as follows: 



m 

SDMi{x,y) = EE \Pj,Ai=x{v) - Pj,Ai=y{v)\’^ 
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For convenience and ease of computation, the ppc-subspace of a symbol can be 
written as a matrix of dimensions mx s, with m being the number of attributes 
and s := max{|£)i| , . . . , |Z)m|}- The positions in this matrix are the respective 
conditional probabilities; columns having less than s symbols are filled with zeros 
up to s elements. That way, the calculation of a symbolic distance can be seen 
as a matrix operation. 

Finally, plugging the SDM into the heterogeneous metric, where previously 
the (S)VDM or the overlap function was used, yields the HSDM: 

Definition 5. With m being the number of attributes and q G N the heteroge- 
neous subspace difference metric (HSDM) of two instances x and y is defined 
as follows: 

m 

HSDM{x, y) = ^ di{xi, yiY 

i=l 

with the distance function di defined heterogeneously as follows: 

{ 1 if X or y is missing 

SDMi{x,y) if attribute Ai is categorical 
otherwise 



Complexity. Due to the fact that the ppc-subspace of a symbol can be written as 
a matrix of dimensions mxs (section 2.2) the comparison of two symbols is linear 
in the number of attributes m and in the maximum number of unique symbols s 
in any categorical attribute. Thus, the computation of a heterogeneous distance 
between two instances is, in the worst case, 0{wfs). Unfortunately, however, 
the explicit pre-computation of the ppc-matrices would take time 0{wfn^) (n 
being the number of instances), yielding 0{n^) complexity for the computation 
of a dissimilarity matrix as well. 

By means of a hashtable, however, many of the computations can be simpli- 
fied. If, e.g., a complete dissimilarity matrix is to be computed, the ppc-subspaces 
of all symbols can be pre-computed by scanning through all instances and record- 
ing each co-occurrence of two symbols x and y in two attributes ai and 02 . That 
way, the ppc-subspace of a symbol is only implicitly represented by all associa- 
tions (along with their counts) for the according symbol stored in the hashtable. 
This pre-computation step basically consists of linearly scanning the database 
and thus can be realized in time 

3 Linearization by Multidimensional Scaling 

One possibility for modularly implementing the SDM in a learning algorithm is 
the HSDM as defined above. However, this necessitates (at least partly) a rewrit- 
ing of the involved algorithm. Unfortunately, this is not always possible. Also, it 
may prove advantageous for subsequent quantitative data mining techniques to 
have available nearness information between the symbolic values. 
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A transformation of the categorical attributes to continuous types in a pre- 
processing step could thus be a possible solution. However, we have to be aware 
of the fact that this task can not always be performed well and that in some 
cases the simple overlap metric might indeed be more appropriate. 

We now extend the SDM to a linearized version, the linearized subspace dif- 
ference metric (ISDM). We accomplish this by applying multidimensional scaling 
(MDS) methods to the distance measurements induced by the SDM. 

MDS is typically used for computing representative data points for high- 
dimensional data (which should, e.g., be visualized) or proximity data (some- 
times even incomplete) in a suitable low-dimensional space, such that the dis- 
tances between the projected data points match the original distance values as 
faithfully as possible. The basic idea consists of minimizing a cost function, usu- 
ally stress, raw stress, strain or something similar. The original algorithm for 
minimizing stress can be found in [Kru64], Sammon mapping, also one of the 
oldest approaches, can be found in [Sam69]. [dLJ77] describe the widely used ma- 
jorization method for MDS and [KB97] present an application of deterministic 
annealing to the problem. 

In our case the task can be defined as projecting data points of m x s di- 
mensions (see section 2.2) onto a continuous scale, i.e. one dimension. For this 
projection we applied a simple gradient descent approach, minimizing raw stress. 

The usual problem of finding good initial configurations applies here as well: 
Applying a strict gradient descent algorithm to a one-dimensional configuration 
can only move the points around a bit, but cannot change their relative order. 
However, because the target space is only one-dimensional, we have available 
a few canonical configurations, which should work reasonably well as starting 
points: 

For a categorical attribute Ai we construct \Di | different initial configurations 
by using one of the symbols Xj as the point of origin and placing the other 
symbols at positions which conform to their distances from Xj. We then run the 
gradient descent algorithm on each of the \Di\ configurations and accept the 
final configuration with the lowest raw stress as the resulting projection. 



4 Experimental Evaluation 

4.1 Quantitative Results 

In a first set of experiments we exemplify the applicability of our approach in 
a clustering setting: We chose several datasets with varying characteristics from 
the UCI Machine Learning Repository and from the Esprit Project StatLog. 

Apart from hayes-roth and postoperative (patient), all datasets had 
both continuous and categorical source attributes. As reference clusterings (for 
the calculation of recall) we used the class labels^; in case of the servo dataset, 

^ It is reasonable to assume that the existing class labels correlate with well-defined 
subspaces within the instance space. There is, however, no theoretical argument 
corroborating this assumption. 
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where the target variable is continuous, we chose to pre-discretize the target by 
equal-width into 6 classes (i.e. clusters). 

The synthetic dataset had five attributes: The attributes 1, 3 and 4 were 
continuous (ranges [1, 100], [1, 50] and [1, 100], respectively), while the attributes 
2 and 5 were symbolic: {A,B} and {w,y,o,r'\. The following rules held for the 
class attribute: (02 = AAa^ > 20 — >■ class = 1), (02 = BAa^ < 15 — >■ class = 2), 
(05 = ic — >■ class = 3) and {else -A class = 4). 

As a clusterer we used the k-means implementation pam [KR90] , which relies 
on the pre-computation of a dissimilarity matrix before clustering, from the 
cluster package of the freely available and widely used statistics software R. 
We compared three different methods: In the first two runs we used daisy^ and 
the HSDM"*, respectively, for the computation of the dissimilarity matrix before 
applying pam. In a third run we used the ISDM to transform the symbols into 
real numbers and applied pam to the continuous dataset. 

For each clustering thus generated we computed two different quality mea- 
sures: On the one hand we used the silhouette coefficient from [KR90] as an 
internal measure. On the other hand we used recall (refer to [LW02]) to compare 
the resulting clustering to the reference clustering. Table 1 shows the results. 



Table 1. Comparative clustering results for various datasets. Shown are the silhouette 
widths (“silh”) and recall values (tolerance = 0.5) of k-means clusterings. The highest 
numbers are printed in bold (no significance test applied). 



dataset 


rows 


cols 


classes 


daisy 
silh. recall 


hsdm 

silh. recall 


Isdm 

silh. recall 


postoperative 


90 


8 


3 


0.18 0.24 


0.18 


0.24 


0.25 


0.24 


hayes-roth 


132 


4 


3 


0.23 0.00 


0.29 


0.00 


0.33 


0.00 


servo 


167 


4 


6 


0.14 0.12 


0.54 


0.16 


0.38 


0.13 


synthetic 


200 


5 


4 


0.25 0.10 


0.40 


0.19 


0.44 


0.05 


heart 


270 


13 


2 


0.26 0.68 


0.24 


0.56 


0.38 


0.40 


crx 


690 


15 


2 


0.17 0.64 


0.18 


0.63 


0.19 


0.66 



As can be seen, the HSDM and/or the ISDM in most cases allow for better 
clusterings: In all cases the silhouette width is higher, which is not really surpris- 
ing, given that the HSDM naturally allows for more accurate distance measure- 
ments than the HEOM - a fact that typically has a positive impact on internal 
quality measures. We have illustrated this consideration by 2-dimensional cluster 
plots in figure 1. 



® Basically, daisy relies on a slightly more sophisticated HEOM to calculate heteroge- 
neous distances by using weighting factors for each attribute, daisy is described in 
detail in [KR90]. 

To be able to use the same format for the dissimilarity matrix we implemented the 
HSDM in pure R-code (no Fortran), which did not use hashtables (refer to section 
2.2). Running times are therefore not comparable. 
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euclidian metric 



subspace difference metric 




Fig. 1. Cluster plots (mapped to two dimensions) for the synthetic dataset (see text), 
clustered by k-means (4 centers) in conjunction with the HEOM (daisy) and the HSDM, 
respectively. 



However, in most cases also the recall values of SDM-clusterings increase. 
From these results it is reasonable to assume that a lower clustering quality is 
largely due to an inappropriate handling of symbolic attributes. 

It has to be noted that the HEOM achieved a better recall rating in one 
case and did not do much worse in two others. As mentioned in section 3, this 
might be due to the fact that in some cases - especially when the symbols have 
no inherent canonical order ~ inducing an artificial order instead of using 0-1- 
equality might actually be detrimental to clustering quality. 



4.2 Qualitative Results 

In a second experiment we wanted to investigate whether the linearized SDM 
ISDM is able to capture canonical orderings within symbolic attributes, i.e. 
whether the concept of characteristic subspaces can indeed reflect intuitively 
obvious orderings and whether these orderings are preserved by scaling the mul- 
tidimensional information down to only one dimension. 

For this test we chose a more complex, but rather small dataset: Pittsburgh 
Bridges (from the UCI repository) is a “design domain” rather than a classi- 
fication domain, where 5 design descriptions need to be predicted based on 7 
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specification properties. The dataset has 108 instances with 13 attributes and 
few missing values. We chose to eliminate the two “identifying” attributes (id 
and location). 

In table 2 we list the orderings induced by applying the ISDM to the dis- 
cretized version of the dataset. The correct (or intended) orderings of most of 
the attributes are know, because they have been discretized from numeric val- 
ues - e.g. the attribute erected reflects time epochs from 1818 to 1986. Note 
that even the attributes lauies (which contains 4 nominal values) was treated as 
discrete in this context. 



Table 2. The induced orderings of the symbolic values in the Pittsburgh Bridges 
domain. Only symbolic attributes with a canonical order are shown, the induced values 
for the missing value-symbol are omitted. Note: Some numbers are presented in reverse 
order - this, of course, does not affect the resulting metric. 



attribute 


correct order 


induced order 


erected 


crafts 


crafts (2.93) 




emerging 


emerging (1.28) 




mature 


mature (0.64) 




modern 


modern (0.00) 


length 


short 


short (1.83) 




medium 


long (1.00) 




long 


medium (O.jO) 


lanes 


1 


1 (4.22) 




2 


2 (2.16) 




4 


4 (1.09) 




6 


6 (0.00) 


Span 


short 


short (0.00) 




medium 


medium (1.96) 




long 


long (2.72) 



As can be seen, linearizing the symbols by the ISDM can re-construct the 
former orderings in all but one attribute, where the order of medium and long is 
inverted. Note that we only list the results for attributes with a known canonical 
ordering here and omit results for attributes like, e.g., purpose (with the values 
walk, aqueduct, rr and highway). 

5 Summary and Discussion 

We have presented the novel heterogeneous metric HSDM for computing dis- 
tances between points with categorical or mixed-continuous-categorical attribute 
values. The approach is based on the concept of (probabilistic projected) charac- 
teristic subspaces, which can also be viewed as a generalization of the well-known 
value difference metric. Additionally, we have introduced the idea of lineariz- 
ing the high-dimensional distance matrices by multidimensional scaling, thereby 
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yielding the ISDM, a reverse transformation to discretization: encoding symbols 
by continuous numbers. 

We have exemplified in a clustering setting that the HSDM most often yields 
better results in terms of internal {silhouette width) and external {recall) clus- 
tering quality measures than the widely used HEOM. Finally, in a qualitative 
setting, we have shown that the ISDM yields good transformations of the in- 
volved symbols into real numbers, especially when an obvious canonical order 
among the symbols exists. 

It has to be noted that the transformation does not always produce superior 
(quantitative or qualitative) results. Especially in cases, when no intuitively ob- 
vious order among the symbols can be found, the naive overlap function yields 
slightly better clusterings. This is obviously due to the fact that in such cases, 
the computation of 0- 1-equality is “as good as it gets” and any artificial ordering 
of the symbols might deteriorate the results. A combination of both approaches 
in a single metric is one of our future research topics. 

Furthermore, the mapping of the high-dimensional subspace data to one con- 
tinuous dimension is obviously a crucial step in our algorithm. Using a more 
sophisticated method for MDS might further improve the final transformation 
and thus is also one of our topics for future research. 

Finally, as one reviewer pointed out, it may be expected that the SDM should 
lend itself well for application in instance based learning tasks (nearest neighbor 
classification). We have chosen to evaluate the proposed metric in a clustering 
setting mainly because - unlike the VDM - it was intended to be used without 
a designated class attribute and will, eventually, be part of a novel clustering 
algorithm. Evaluating the SDM in IBL settings is a valuable suggestion, which 
we plan to tackle in the near future. 
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Abstract. Many pattern discovery methods provide fast tools for find- 
ing the frequently occurring patterns in large data sets. Such pattern 
collections can also be used to approximate the underlying joint dis- 
tribution, and they summarize the data set well. However, a large set 
of patterns is unintuitive and not necessarily easy to use. In this pa- 
per we consider the problem of ordering a collection of patterns so that 
each prefix of the ordering gives as good a summary of the data as pos- 
sible. We formulate this problem for general loss functions, show that 
the problem has an efficient solution, and prove that its natural vari- 
ant is NP-complete but the greedy approximation algorithm gives an 
e/(e — 1) « 1.58 approximation quality. We apply the general tech- 
nique to approximation of frequencies of frequent sets, and show that 
the method gives good empirical results. 



1 Introduction 

Many pattern discovery methods provide fast tools for finding the frequently 
occurring patterns in large data sets. However, many methods also result in 
large collections of patterns which are difficult to use. There has been lots of 
work on techniques for pruning the pattern collections without losing too much 
information (see, e.g., [1,2, 3, 4, 5]). 

A collection S of patterns whose frequencies are known can be used to esti- 
mate the frequencies of other patterns in several ways. For example, in frequent 
set mining with threshold ct, if we know the frequencies of AB, AC, and BC 
we can estimate the frequency of ABC by at least three methods: by a/2, by 
the minimum of the frequencies of AB, AC, and BC, or by maximum entropy 
methods [6]. Other techniques exist, too, see e.g. [7,8,9]. 

In this paper we consider the following simple problem: given a collection 
of patterns and an estimation method for the frequencies of unknown patterns, 
how should we sort the known patterns in order of decreasing informativeness 
for the estimation? The solution of this problem gives an ordering such that each 
prefix is as informative as possible with respect to the following patterns. 

We formulate this problem for general pattern classes, estimation methods, 
and loss functions. We show that the problem can be solved efficiently, and 
prove that its natural variant is NP-complete but the greedy method yields 
an e/(e — 1) approximation algorithm for certain loss functions and estimation 
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methods, where e is the base of natural logarithms. We apply the general tech- 
nique to approximation of frequencies of frequent sets, and show that the method 
gives good empirical results. 

The rest of this paper is organized as follows. Section 2 gives background on 
the general framework of pattern discovery and on condensed representations. 
Section 3 describes the pattern discovery problem shows that an optimal solution 
for the problem can be used as a good approximation of the pattern collection. 
In Section 4 the approximation technique is illustrated with concrete estimation 
methods and loss functions. The technique is experimentally evaluated in Section 
5. Section 6 is a short conclusion. 



2 Background and Related Work 

Pattern discovery, i.e., finding interesting patterns from a data set, is the central 
task in data mining [10,11]. The pattern discovery problem can be formulated as 
follows: given a pattern collection V and a quality predicate (or an interestingness 
predicate) q : V ^ {0, 1}, find all interesting patterns, i.e., the patterns p € V 
such that q{p) = 1. The predicate is usually defined by using a quality measure 
(j) : P [0,1] and a threshold value a G [0, 1]: 

ol'u'l = / ^ 

a\P) Q otherwise. 



Several measures of quality have been proposed [12]. The most prominent mea- 
sure of quality is the frequency of the pattern w.r.t. the data set. Especially, 
frequent set mining has received considerable attention [13]. In frequent set min- 
ing, the data set d is a finite sequence d = di . . . d„ of subsets of a set R, the 
pattern collection P is the collection of subsets of R, the frequency of a set 
X C R (w.r.t. the data set d) is 



fr{X) = fr{X,d) 



\{i '■ X C di,l < i < n}\ 
n 



The set X C R is considered interesting if and only if fr{X) > a. 

Finding a good measure of quality and an adequate threshold value is not 
easy. To avoid missing interesting patterns, very low quality threshold values 
might be needed. This implies that the number of patterns deemed to be inter- 
esting can be quite large. This is not necessarily a problem if the true objective 
of the pattern discovery was to find all the interesting patterns w.r.t. the quality 
predicate. 

Several methods have been developed for finding condensed representations 
of the pattern collections (see e.g. [14,15,1,16,17,18,19,20,21,22,23,5]). The con- 
densed representations are small descriptions of the pattern collections such that 
it is possible to infer the original collection of interesting patterns and the qual- 
ity values (approximately) using some inference method. They depend on some 
structural properties of the pattern collection and the quality measure. Usually 
condensed representations choose a subset of all interesting patterns and infer 
the quality values of the interesting patterns from that subset. 
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The most popular condensed representation of pattern collections is the con- 
cept of closed patterns. The representation depends only on the partial order of 
the pattern collection and the antimonotonicity of the quality measure. Let ^ 
be a partial order for the pattern collection V. A pattern p € V is closed if and 
only if its quality value is greater than any of its superpattern’s quality value, 
i.e., if and only if 

yq G V : p ^ q ^ 4>{p) > (j){q). 

The number of closed patterns can be much smaller than the number of all 
patterns. 

Unfortunately even the condensed representations of the pattern collection 
can be very large. Thus we suggest ordering the patterns w.r.t. their informative- 
ness. From the ordered collection of patterns, the user can interactively choose 
the appropriate trade-off between the number of chosen patterns and the accu- 
racy of the approximation. 

For brevity, for the rest of the paper we shall consider frequencies instead of 
arbitrary quality measures. 

3 Pattern Ordering and Frequency Estimation 

Most condensed representations of a pattern collection consist of a subset of the 
pattern collection. Thus a simple approach to simplify the condensed represen- 
tation is to order the patterns in the condensed representation so that that the 
next pattern increases the knowledge about the pattern collection most. 

Given a collection of patterns, we are interested in finding an ordering such 
that for each i = 1, . . . , n the prefix pi, ... ,pi of the ordering pi, . . . gives 
as much information about Pi+i, . ■ . ,Pn as possible. To formulate this we need 
to define estimation methods and loss functions. An estimation method if takes 
a subcollection S of all patterns V with known frequencies, and provides ap- 
proximations of the frequencies of all patterns. I.e., an estimation method is a 
function 

[0,1]^^ [0,1]. 

A trivial example is the estimation method which gives the known frequencies 
for patterns in S and 0 for everything else. 

The loss function i tells what penalty is to be paid for errors in estimating 
the frequencies. The loss function takes as inputs the true frequencies of the 
patterns and the estimated frequencies, and returns a score for the estimation: 

[0,1]^ X [0,1]^ 

A typical example of a loss function would be the metric 
(.{x,y) = 

The task of ordering the patterns can be formulated as a computational 
problem as follows: 
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Input: A pattern collection V, \ V\ = n, a quality measure </> : P — >■ [0, 1], an 
estimation method il; ■. V x [0,1]“^ — ^ [0,1], 5 C V, and a loss function 
e : [0, 1]^ X [0, 1]^ ^ Q. 

Output: The pattern collection P as an ordered sequence pi,p 2 , ■ ■ ■ ,Pn such 
that 

i {(j) {V ) , V' {P , </*! {pi, • • • ,p^-l,p^})) <^{(!>{P)A {P, <l)\ {pi, • ■ • ,Pt-i,Pj})) 

for each 1 < i < j < n, where is the restriction of the mapping (j) to the 
set S. 

We call this problem the pattern ordering problem. The problem can be solved 
by a greedy algorithm as follows: 

Order- Patterns(P, 4>, ip, £) 

1 Po = 0 

2 for z ^ 0 to n — 1 

3 do p,+i ^ argp min {£ (0 (P) , z/> {P, (j)\P, U {p})) :pGP\P^} 

4 Pi+i ^ PiD {pi+i} 

5 return pi, ... ,p„ 

The running time of the algorithm depends on the efficiency of finding in 
each iteration i the pattern pz+i that decreases the error most. The time com- 
plexity is the combined time complexity of finding the minimums. Let M (P) be 
the maximum time complexity of finding a pattern Pi+i such that the loss for 
Pi U {pi+i} is as small as possible. Then the time complexity of the algorithm 
is bounded by O {nM (P)). For example, using the trivial estimation method 
which gives the known frequencies for the chosen patterns and 0 for the others, 
the minimum in each iteration can be found in logarithmic time in n using a 
heap [24] (assuming the loss depends only on the differences fr {p) — ip (p, fr\S)). 

The ordering pi , . . . , p„ of the pattern collection P found by the algorithm 
Order- Patterns can be interpreted as a refining approximation of the pattern 
collection: each prefix Pk = {pi, . . . ,pfc} approximates the whole pattern collec- 
tion P. The ordering might itself shed some light to the relationships between 
the patterns. In addition, for several combinations of estimation methods and 
loss functions it can be shown that each prefix of the ordering gives a frequency 
approximation that is guaranteed to be at most a constant factor worse than 
the frequency approximation from any subset of P of same size. 

The greedy approach, in general, offers an efficient approach to find solutions 
for a wide variety of problems and several exact and approximate algorithms have 
been successfully derived by this approach [25,26,27,28]. Also in the case of the 
pattern ordering problem it is possible to show for certain estimation methods 
and loss functions that any prefix Pk = {pi, . . . ,pk} of the optimal solution 
Pi, . . . ,p„ for the pattern ordering problem is at most e/(e — 1) worse and has 
at most (e — l)/e times smaller decrease in loss than any size k subset S of P. 
(For more in-depth introduction to approximability, see e.g. [29].) On the other 
hand, the problem of finding the k patterns that describe the collection best can 
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be shown to be NP-hard. Thus finding the size k optimal subset of V seems to 
be infeasible all but very small k. 

Let us define some notation. The decrease of loss w.r.t. frequency estimation 
without any known frequencies is denoted by 

A{s) = e {V) , ^ {V, m) -n4>{v),^p {V, 0 | 5 )) . 



Let V^. be a size k subset of V with the smallest loss. The prefix of length 
k of the optimal solution for the pattern ordering problem is denoted by Vk- 
The problem of finding the best k subset of the pattern collection resembles 
a lot the minimum set cover problem, i.e., given a collection T of subsets of 
a finite set R, find the smallest subset 5 of T such that 1J5 = i? [30]. Thus 
the approximation quality of the algorithm Order- Patterns can be proven 
similarly to the approximability of certain variants of the minimum set cover 
problem [25]. 

First we prove Lemma 1 below which can be used to show that certain 
combinations of estimation methods and loss functions guarantee that A (Vk) > 
^^A (V^) holds for all 1 < z < fc. I.e., the decrease of the error in the frequency 
estimation (w.r.t. the frequency estimation with no patterns) from the length k 
prefix of the pattern ordering found by the algorithm Order-Patterns is at 
least a fraction (e — l)/e of the best decrease of the error over the size k subsets 
of V. All one has to show is that the error decreases sufficiently in each iteration. 
Lemma 1 will be used in Section 4. 



Lemma 1. If 

A {Vi) - A > 

holds for all 1 < i < k then 



A{Vl) - A{V,-,) 



A{Vk)>—A{Vl). 

e 



holds for all 1 < k < n. 



( 1 ) 



Proof. From Equation 1 we get 

A {V.) > Ia (S) + (l - 0 ^ > I A {VD + (l - ^ (^-i) 

Thus 

A {Vk) > ("l - (^1 - ^ A {Vt) > (^1 - A {VD 



as claimed. 



□ 
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It is possible to show the similar result for the loss instead of the decrease of 
the loss: 

Lemma 2. If 



e (</> (v) , ^ {p, m-i)) -H<P{P),f^ {P, m)) > 

I (</. (P) , {P, </)|iP._i)) -I{cf{P)A {P, cj,\Pt))) (2) 

holds for all 1 < i < k then 

I if (P ) , ^ (p, <f\p,)) < (0 {P ) , {P, f\Pl)) 

e — 1 

holds for all \ < k < n. 

Proof. The proof is essentially identical to the proof of the Lemma 1. □ 



4 Case Study: Approximating by Maximums 
of Superpattern Frequencies 



In this section we consider approximating the frequencies of the frequent pat- 
terns using the maximums of known superpattern frequencies. To define what 
superpattern is, we need a partial order ^ for the pattern collection P. A partial 
order ^ for the collection P is transitive (p^qAq^r^p^r) and irreflexive 
{p < q ^ p ^ q) binary relation on P. We denote (p, q) hy p < q. We 
further assume that the partial order is antimonotone w.r.t. the frequencies, i.e., 
p ^ q ^ fr (p) > fr (q). For example, the set inclusion relation is such a partial 
order. A pattern q is a superpattern of p if and only ii p ^ q. The estimation 
method of maximum of superpattern frequencies is 



V' {P, fr\S) = max ({/r (p) :p=q}U {fr {q) :p-<q}U {0}) . 



The smallest subset of frequent patterns that is sufficient to describe the 
frequencies fr (p) of the frequent patterns p in P correctly is called a collection 
of closed frequent patterns. More precisely, a pattern p G P is closed if and only 
if 

fr{p) > m;K {fr{q) ■. p < q} . 

q£V 

A closure of a pattern p GP, denoted by d{p), is the largest superpattern q gP 
of p such that fr (p) = fr{q). The set of closures of a pattern collection P is 
denoted by d{P). 



Theorem 1. The collection d {P) of dosed frequent patterns is the smallest 
collection such that for all frequent patterns p GP we have 



fr{p) =ijj{pjr\ d {P)) . 
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Proof. By definition, for each p G P there is q = d (p) G d {V) such that 
fr (p) = fr (q). Also, no closed pattern can be left out from the collection. □ 

It follows from the definition of closed frequent patterns that they can be 
chosen from the collection of the frequent patterns by simply checking for each 
pattern whether any its superpatterns (subpatterns) have equal frequency and 
pruning the pattern (subpattern) if that holds. The efficiency of the method 
depends strongly on the pattern collection, the partial order and their represen- 
tations. For example, the closed patterns from the collection V of frequent sets, 
i.e., the closed frequent sets (or frequent closed sets), can be found in time 

S (lYf-l)<W-l> = °(W'|P|) 

jce-p.x/0 ^ 

by applying the fact that X G d {V) if and only if fr (A) ^ fr (T) for all 
Y gV,Y Z) X,\Y\ = |A|-bl. 

The problem turns out to be NP-hard if we allow errors. Let us first consider 
the maximum of absolute errors, i.e., 

i {fr {V ) , V' {V, fr\S)) = max |/r (A) - (A, fr\S) \ 

A t 

= max fr (A) — max \fr(Y) : X C Aj 
xev ^ ' Yes ^ ^ J 

= xIp -X CY} 

Then the problem is NP-hard even for the pattern class of frequent sets: 

Theorem 2. Given a collection V of frequent sets and a rational number e, it 
is NP-hard to find a smallest subset S of V such that 

max (^fr {V) - max{/r (A) : A C A}^ < e. 

Proof. We show the NP-hardness by a reduction from the decision version of 
the minimum set cover problem, where the objective is, instead of finding the 
smallest subset S C T such that IJ 5 = i?, to decide whether there is a subset 
S C Y of size at most k such that IJ 5 = i?. We can assume that each element 
in R occurs in some set Y, the cardinality of each set in Y is greater than one, 
and no set in Y is contained in another set in Y. 

Let us construct the data set d of subsets of R as follows: d consists of T 
and appropriate number of one-element subsets of R such that fr{{x},d) = 
fr ({y}, d) for all x,y G R. Let e = fr ({x}, d) — 1/n, x G R. 

Then for each S G_Y, |5| < fc, holds: 

(j5 = i?<t^max|^/r(P)-max{/r(A) : A C A}^ < e. 

□ 
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Corollary 1. Given a pattern collection V and a rational number e, it is NP- 
hard to find a smallest subset S of V such that 

On the positive side, it can be shown for the estimation method of choosing 
the maximums of known superpattern frequencies that the problem of choosing 
size k subset of patterns such that the maximum absolute error is minimized is 
a special case of the minimum weight set cover, which is, given a collection T of 
subsets of a finite set R and a weight function ic : T — >■ [0, 1], to find a subset 
S CT of smallest weight 

w{S) = '^w (p) 

P&S 

such that U 5 = i? [29] . 

If the loss function is, e.g., the average error instead of the maximum error, 
the connection to set cover is not so obvious. Also in that case the approxima- 
bility guarantees can be established: 

Theorem 3. For the prefix Vk of length k of an optimal solution for the pattern 
ordering problem and the size k subset of V with the smallest loss we have 

A{V^)>—A{Vl) 

e 

with respect to any loss function 

£{fr{V) ,f^{V,fr\S)) = ^/(|/r(p) - {p, /r|5)|) 

pev 

where f is a convex strictly increasing function. 

Proof. It suffices to show that Equation 1 holds. We have 

= /(l/’’(p) - V'(p,/r|^*-i)l) - XI ^ {p, fr\Vi)\) 

pev pev 

- ^ ( X /(I/^(P) -^(P>/^l^i-i)l) - X /(I/^(P) -^(P>/^l^fc)l) 

\p€V p&V 

_ A{V*)-A{V,_,) 
k 

because {pi} = Pi \ Pi-i is the pattern that decreases the error most and 
(p) -V’b, fr\S)\) 

P&V 

= Xmin{/(|/r(p) - {p, fr\S \T)\) , f {\fr{p) - {pJr\T)\)} . 

pev 



holds for all T C S. 



□ 
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The computation of the approximation can be made more efficient by ob- 
serving the following fact that all but the closed patterns in V can be neglected. 



Theorem 4. For all £ and S QV we have 

i {fr {P) , {P, fr\S)) = £ ifr {P) , {P, fr\ cl (5))) . 

Proof. Any pattern p G S can be replaced by cl (p) as fr (p) = fr {cl {p)), and 
if tp {p, fr\S) = fr {p) then tp {p, fr\S) = fr {cl {p)). □ 

5 Experiments: Approximating Frequent Sets 

We implemented the Order- Patterns algorithm to evaluate the practical use- 
fulness of the method. In the experiments we computed frequent sets with differ- 
ent minimum frequency thresholds for two data sets from UCI KDD Repository^: 
Internet Usage data set consisting I0I04 rows and 10674 attributes, and IPUMS 
Census data set consisting of 88443 rows and 39954 attributes. 

The estimation method was the maximum of chosen superset frequencies, i.e.. 
Ip {X, fr\S) = maxvg 5 {fr (F) : X C F}, and the loss function was the mean of 
absolute errors 

£ {fr {P ) , p, {P, fr\S)) = ^lj2fr{X)- max {fr (F) : X C F} 

' ' \xev 

The results are shown in Figure 1, and in Tables 1 and 2. The results show 
that relatively short prefixes can be used to obtain a good accuracy in estimating 
the frequencies. The inversion of the order of the error curves in Figure 1 is 
due to the combination of the estimation method and the loss function: As the 
initial frequency estimates are all zero, the average absolute error is smaller for 
lower minimum frequency thresholds. On the other hand the frequencies can 
be estimated exactly from the closed frequent sets and the number of closed 
frequent sets is smaller for higher minimum frequency thresholds. 

6 Conclusions 

We have considered the problem of ordering a pattern collection in such a way 
that each prefix of the ordered sequence of patterns would be as good a summary 
of the pattern collection as possible. A general algorithm was given, the prob- 
lem complexity and the algorithm were analyzed and the approach was justified 
experimentally. It seems that the problem of finding good orderings of pattern 
collections is useful and interesting. Several open problems remain. One spe- 
cially interesting one is combining pattern discovery and ordering steps: could 
we somehow discover patterns in an order that approximates the most informa- 
tive pattern ordering. To do this exactly is impossible, but there might be some 
possibilities of obtaining approximate results in the style of competitive analysis. 




^ http://kdd.ics.uci.edu 
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Fig. 1. Internet Usage data (left) and IPUMS Census data (right). The axes are the 
length of the prehx of the pattern ordering and the average absolute error of the fre- 
quency estimation from the prefix. Each curve corresponds to the minimum frequency 
threshold given as its label. 



Table 1. Internet Usage data. The column g corresponds to the minimum frequency 
threshold. Columns \V\ and \cl iV)\ correspond to the cardinalities of the frequent sets 
and the closed frequent sets, respectively. Each column |r (x)| corresponds to the length 
of the shortest prefix found by the algorithm Order- Patterns such that the average 
absolute error is at most x. 



cr 


\V\ 


\d {V)\ 


|r (0.001)1 


|r (0.005)1 


|r (0.01)1 


o 

o 


|r (0.04)1 


|r (0.08)1 


0.17 


3246 


3246 


2672 


1925 


1421 


970 


597 


231 


0.16 


4013 


4013 


3254 


2295 


1671 


1132 


655 


242 


0.15 


4983 


4983 


3994 


2764 


1995 


1377 


775 


270 


0.14 


6291 


6290 


4955 


3339 


2362 


1602 


860 


261 


0.13 


8000 


7998 


6208 


4093 


2881 


1972 


1034 


281 


0.12 


10476 


10472 


7970 


5118 


3562 


2414 


1189 


289 


0.11 


13813 


13802 


10267 


6352 


4305 


2804 


1284 


264 


0.10 


18615 


18594 


13468 


8068 


5409 


3395 


1423 


245 


0.09 


25729 


25686 


18035 


10399 


6920 


4094 


1587 


203 


0.08 


36812 


36714 


24870 


13681 


9032 


5008 


1708 


153 


0.07 


54793 


54550 


35441 


18477 


12147 


6276 


1803 


95 



Table 2. IPUMS Census data. The columns are as in Table 1. 



a 


\V\ 


\d{V)\ 


|r (0.001)1 


|r (0.005)1 


|r (0.01)1 


|r (0.02)1 


|r (0.04)1 


o 

o 


0.28 


11443 


1696 


551 


351 


260 


184 


120 


66 


0.27 


13843 


1948 


624 


395 


292 


203 


128 


68 


0.26 


17503 


2293 


725 


456 


338 


233 


147 


71 


0.25 


20023 


2577 


810 


502 


369 


256 


161 


77 


0.24 


23903 


3006 


944 


583 


427 


293 


185 


92 


0.23 


31791 


3590 


1093 


661 


477 


328 


196 


85 


0.22 


53203 


4271 


1194 


678 


481 


316 


171 


57 


0.21 


64731 


5246 


1454 


813 


573 


372 


189 


62 


0.20 


86879 


6689 


1771 


949 


661 


424 


218 


67 


0.19 


151909 


8524 


1974 


953 


628 


363 


151 


27 


0.18 


250441 


10899 


2212 


992 


625 


312 


99 


10 
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Abstract. We propose a new collaborative filtering method that uses 
restoration operators. The problem of restoration by operators was origi- 
nally studied in the field of digital image restoration [9] . We also consider 
the problem of selecting items that users should be asked to rate in or- 
der to achieve a small expected squared error, and we propose a greedy 
method as a solution of this problem. According to our experimental re- 
sults, prediction performance of restoration operators is good when the 
number of observed ratings is small, and our greedy method outperforms 
random query item selection. 



1 Introduction 

Information filtering has become an important technology in recent years due to 
wide spread of Internet. This technology enables systems to learn a user’s per- 
sonal preference, and recommend items (such as news articles, music and books) 
that are preferred by the user. Collaborative filtering, an information filtering 
technique, has been studied extensively in recent years [10,11,2,8]. This filter- 
ing technique enables a system to recommend items to a user that are preferred 
by similar users. Generally, a user’s preference for an item is represented by a 
rating, and calculation of similarities between users and judgement on whether 
similar users prefer or not are based on the ratings. Collaborative filtering prob- 
lem can be also considered as the problem of predicting unknown entry values 
from known entry values of a partially known user-item rating matrix. 

The performance of a system in giving initial recommendations for a new 
user is important because a user will not continue to use the system if it takes 
long time for the system to learn the user’s preference. In this paper, we propose 
a collaborative filtering method using restoration operators that can make good 
predictions based on a small number of ratings. Furthermore, we consider the 
problem of selecting items that a new user is asked to rate in order to achieve 
optimal prediction performance. This kind of learning, in which a learner obtains 
information needed to learn actively, is called active learning [4] and it is one of 
the current hot research topics. 

There have been a few studies on collaborative filtering using active learning. 
Goldberg et al. [7] developed a system that asks new users to rate the same set of 
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items called the gauge set. The system uses the ratings for the items in the gauge 
set to classify the user into one of the clusters and then recommends the items 
that are popular among the users belonging to that cluster. However, Goldberg et 
al. did not describe how the gauge set is selected. Boutiler and Zemel [3] proposed 
a system that asks new users to rate an item with the maximum expected value 
of information with respect to the current probabilistic model. Dasgupta et al. 
[5] considered that there are a number of typical users and that each user’s 
ratings are close to those of one typical user. They proposed a query selection 
algorithm and analyzed the number of queries to an arbitrary user needed for 
finding a typical user with similar ratings. Here, a query means asking a user 
to rate a given item. Query selection, therefore, means determination of which 
item is given. 

Restoration operators have been extensively studied in the field of digital 
image restoration [1,9]. For a known degration operator P, they have studied 
the problem of how to restore a digital image x G 3?'" from a given degraded 
image Px G < m). Collaborative filtering problem can be also seen as 

a restoration problem in which a vector x of user’s preference for all items is 
restored from a partial vector Px of x, where P is an operator that restricts 
the components to those having observed values. Considering characteristics 
of collaborative filtering problem, we propose a collaborative filtering method 
using an unbiased Wiener filter without additive noise. Our experimental results 
showed that our method outperforms the correlation-based method with the best 
performance when the number of observed ratings is small. 

In our problem setting, the active learning problem, the problem of selecting 
the best items that a new user is asked to rate, is that of finding a restriction 
operator P that minimizes the expected squared error. However, to the best of 
our knowledge, there are no efficient methods to find the best operator P among 
0{m^) operators, where m is the number of items and k is the number of queries. 
Instead of finding the best operator, we propose a greedy method that finds an 
operator with small expected error by greedy search which involves comparison 
of 0{km) operators. Our experimental results showed that our greedy method 
is better than random selection in terms of prediction performance. 

This paper is organized as follows. In Section 2, we formalize a collaborative 
filtering problem as a restoration problem and show its solution. In Section 3, 
we describe an optimal query item selection problem and our greedy selection 
method. Results of experiments using the EachMovie data set are shown in 
Section 4, and future work is discussed in Section 5. 



2 Collaborative Filtering Problem 
as a Restoration Problem 

Let m denote the number of items and x G 3?™ denote a vector of the user’s pref- 
erence for all items. Assume that x is generated according to an arbitrary distri- 
bution over 3?™. The collaborative filtering task can be regarded as a restoration 
of vector x from a given partial vector of x. Such a restoration problem has 
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been studied in the area of digital image restoration. In this section, we for- 
malize a collaborative filtering problem like the formalization of a digital image 
restoration problem presented in [9] . 

For J = {zi,Z 2 , •■•jZfc} C {1,2, let Pj denote a restriction transforma- 

tion (matrix) that restricts components of a vector to those in J, namely, Pj is 
a transformation such that Pjx = for x = (xi,X 2 , ■■■,Xm)'^ ■ 

Observing PjX for some J, we want to find an estimation x of x which is close 
to X, that is, the mean squared error (MSE) \\x — is as small as possi- 
ble, where 1 1 • 1 1 denotes Euclidean norm. Since x is calculated by applying a 
restoration transformation to Pjx, this problem is reduced to that of finding an 
optimal transformation. For fixed J, this problem can be formalized as follows 
if restoration transformations are restricted to affine transformations. 

Problem 1. For given J with^ |J| = k, find satisfying 

(^opt>^opt) = E^\\x - BPjX - w\\^ , (1) 

where the minimization is with respect to all possible linear transformations 
B ^ 3?™ and all possible vectors w G 3?"*. 

This problem setting can be regarded as that of an unbiased Wiener filter 
without additive noise. (See [9].) The solution for this problem is as follows. Note 
that and tr(-) denote the transpose and trace of a matrix, respectively. 

Solution 1. 

^opt = RpjiPjRpjr, 

’*^Opt ~ ~ RoptPj)P^^ 

min(^B.-w)Ex\\x - BPjx - u;||2 = tr{R - Bq^^PjE), 
where R = Ex{x— Exx){x— E^x)"^ and (PjRPj)'^ is the Moore-Penrose inverse 
oiPjRPj. 

Why do not we take noise into account? In digital image restoration, we 
assume that an original image x contains no noise and that as training data we 
observe Px -\- n, an image which is degraded by a degradation operator P and 
to which noise n is added, as well as x. If we take noise into account in our 
collaborative filtering problem, it should be assumed that we observe a user’s 
preference x that contains noise and a partial vector Pjx that does not contain 
additional noise. Assume that the observed vector x contains noise n. Then, 
what we want to estimate is x that minimizes \\x — n — &|p. Thus, instead of 
Equation (1), we should use the equation 

(Povt^'^ovt) = a.rg min E^En\\x - n - BPjx - w\\'^ . (2) 

^ ^ (B,w) 

If we assume that x and n are mutually independent and that E{n) = 0, 
E^En\\x -n - BPjx -w\\‘^ = Ex\\x - BPjx - w\\'^ -G E„||n|p. 

^ Here, | J| denotes the number of elements in J. 
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Since the term i?„||n|p does not affect the minimization, Equation (2) coincides 
with Equation (1). Therefore, our problem setting deals with the case in which 
noise is contained in a user’s preference, which can be inevitable because user’s 
evaluation values are quantized. 

When Solution 1 is used, consideration must be given to the method used for 
estimating the covariance matrix R = Ex{x — ExX){x — E^x)"'" . What makes 
this difficult is that this estimation must be done from the user’s preference 
dataset X, the elements of which have many missing values. Here, we estimate 
R as follows. Let rii denote the number of elements in X whose ith component 
is not missing. Let x = {a;i,T 2 , ...,Tm} denote an estimation of E^x. Then, in 
our estimation, 

^ Xi/m, (3) 

x^X,Xi^*- 

where * denotes the missing value. For x G X, let x' denote the vector that is 
made from x by replacing all missing values Xi with Xi. Our estimation fjj of 
the (i,j) entry value of R is calculated as 

h,j = ~Y1 

where n is the number of elements in X. 

3 Optimal Query Item Selection 

Another concern is for which items we should know the user’s preference in order 
to estimate the whole vector of the user’s preference as precisely as possible. The 
problem of optimal query item selection is that of finding the best J, which is 
fixed in Problem 1. 

Problem 2. For given 1 <k <n, find Jqp^ C {1,2, ...,m} satisfying 

■^opt = “ BPjX - tnip, (4) 

where the second minimization is with respect to all possible linear transforma- 
tions B : ^ 3?™ and all possible vectors w G 3?™. 

By plugging Solution 1 into Equation (4), we obtain the following equation: 

^pt = arg max tr(RPj'(PjRPj')+PjR). (5) 

Finding Jqp^ requires calculation of tr{RPj {PjRPj)^ PjR) for m{m — 
1) ■ ■ ■ {m — k + l)/k\ combinations J of items, so it is not practical for a large 
value of m. In that case, we use a greedy method by which optimal items are 
added to J one by one until the number of items in J becomes k. Our greedy 
method uses Jk instead of Jopt- is defined as follows: 

Jfc = arg max{ {PjRPj)~^ PjR) : J C Jk-i, \J\ = k}, Jq = 0. (6) 

Note that the number of combinations J in calculation of Jk is 0{km), while 
that in calculation of Jqp^ is 0{m^). 
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4 Experiments 

We conducted two experiments: one for evaluation of prediction performance 
based on a small number of rated items, and the other for evaluation of query 
item selection by our greedy method. 

4.1 Correlation-Based Methods 

We compared the performance of our method to the performances of correlation- 
based methods [10, 11]. By correlation-based methods, predictions are made us- 
ing correlation coefficients between each pair of users that are calculated from 
ratings for the commonly rated items. The correlation coefficient between users 
X and y is calculated as follows: 

~ ^x,j){yj — Cy,j) 

— j — ; 

where Cxj is a center value for user^ x and item j. Prediction of Xi for user 
X with Xi = * are made using correlation coefficients Wx.y between user x and 
other users y with yi * as follows: 



'y ' Wx.yijjj Cy^i) 




ViT^* 



With respect to Cx,i, we consider the following three variations. 

1. Fixed center 

In this variation, Cx,i is the same constant for all x. The method using this 
correlation coefficient is called constrained Pearson r algorithm in [11]. 

2. User mean 

In this variation, Cxy is the average of Xj for all item j with Xj yf *. The 
method using this correlation coefficient is called Pearson r algorithm in [11]. 
In the case of new users having no ratings, we use the average of all scores 
rated by all users. 

3. Item mean 

In this variation, Cx^i is the average of Xi for all users x with Xi yf *, namely, 
Cx,i = X, where x is defined in Equation (3). 

^ We are somewhat loose in our notation here, nsing a vector of user’s preference as 
if it were the user itself. 
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4.2 Data 

In our experiment, we used the EachMovie collaborative filtering data set [6]. 
The data set consists of 2,811,983 numeric ratings for 1,628 movies evaluated 
by 72,916 users. The numeric rating for a pair of user and movie represents how 
much the user likes the movie on a six-point scale (0.0, 0.2, 0.4, 0.6, 0.8, 1.0). 
Note that only 2.37% of the user-movie matrix is filled. 



Table 1. Statistical information on the data used in our experiments 



#Users 


^Movies 


1 ^Ratings 


Filled % 


0.0 


0.2 


0.4 


0.6 


0.8 


1.0 


total 


2,000 


1,618 


93,980 

(17.0%) 


32,895 

(5.9%) 


73,277 

(13.2%) 


138,337 

(24.9%) 


135,406 

(24.4%) 


80,950 

(14.6%) 


554,845 


17.1 





max 


min 


average 


# (Ratings by a user) 


1455 


191 


277.4 





max 


min 


average 


^^(Ratings for an item) 


1796 


1 


342.9 



In our experiments, We used the ratings of 2,000 users with the largest num- 
ber of ratings. Statistical information on our data is shown in Table 1. Note that 
17.1% of the user-movie matrix of the data is filled. 



4.3 Performance Measures 



We evaluate the performance by two measures: mean squared error and recall- 
precision curve. The former is suitable to our problem setting, but the latter is 
more appropriate from a practical point of view. The following is an explanation 
of the method used for drawing recall-precision curves. 

Rating values are divided into two classes, hot (0.8 or 1.0) and cold (0.0, 0.2, 
0.4 and 0.6), as in [2]. Precision and recall are used to evaluate what degree a 
method correctly predicts a set of hot movies for each user. Let T denote the set 
of hot movies for an arbitrary user and S denote the set of movies predicted to 
be hot for that user. Then, precision and recall for this prediction is defined as 
follows: 



precision = 



\Tns\ 

1^1 



recall = 



irns'l 

~W~ 



For methods in which prediction values are given by real numbers, the set of 
movies predicted to be hot can be obtained by determining a threshold and 
selecting movies whose predicted values are greater than the threshold. In such 
cases, a recall-precision curve can be drawn by moving the threshold. We want to 
know the performance averaged over all users, but the problem is the difference 
in user’s thresholds for the same recall. Thus, for recall r and user u, we find the 
set Sr,u of movies predicted to be hot by lowering the threshold until recall r is 
achieved, and we calculate averaged precision for recall r as follows: 






averaged precision for recall r 
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where is the set of hot movies for user u and U is the set of users. We draw 
the recall-precision curve by moving recall r. 

4.4 Experimental Methodology 

We randomly divided the 2,000 users into 10 groups and carried out a 10-fold 
cross validation. We first estimated the covariance matrix R using only training 
data. Then, for each user in a test data, we selected a set of movies for queries 
from the movies whose scores are known. Based on the ratings for the selected 
movies, predictions for the other movies whose scores are known were made. 
We evaluated methods by prediction performance averaged over all users. In the 
case of random query item selection, prediction performance was also averaged 
over five runs. 



4.5 Results 

Prediction Performance Based on a Small Number of Rated Items. 

We compared prediction performance of restoration operators with prediction 
performances of correlation-based methods when a small number of items are 
selected randomly as a set of items that are assumed to be rated. 




Number of rated movies 



Fig. 1. Learning curves 



Fig. 1 shows the relation between the number of rated movies and MSE. 
Here, predictions by restoration operators using no rating are those by the es- 
timated mean vector x defined by Equation (3). Among the correlation-based 
methods, the method using the item mean performed best. As can be seen in 
Fig. 1, prediction performance of restoration operators was better than predic- 
tion performances of correlation-based methods. 

Fig. 2 shows recall-precision curves when the number of rated movies are 
five and ten. The results are consistent with the results in terms of MSE. Pre- 
diction performances of two correlation-based methods using the fixed center 
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restoration operator 
Correlation for fixed center 
Correlation for item mean 
Correlation for user mean 
mean vector 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 
Recall 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 
Recall 



Fig. 2. Recall-precision curves (Left: prediction using 5 rated movies, Right: prediction 
using 10 rated movies) 




Number of queries 
Fig. 3. Learning curves 



and the user mean could not even exceed prediction performance of the mean 
verctor. The method using restoration operators slightly outperformed the best 
correlation-based method using the item mean at low recalls, which are more 
important parts because recommender systems only present high-ranking items. 



Effectiveness of Query Item Selection. We investigated the effectiveness of 
our greedy query item selection. 

Fig. 3 shows the relation between the number of queries and MSE. AS can 
be seen in the figure, MSE decreases as the number of queries increases. We 
compared prediction performance of our greedy methhod with that of random 
selection. The graph shows the greedy method outperformed random selection. 

The left panel of Fig. 4 shows recall-precision curves for predictions made 
by the greedy method and the mean vector. It can be seen that the perfomance 
improves as the number of queries increases, particularly at low recalls. Improve- 
ment in performance at low recalls with respect to number of queries is shown in 
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Fig. 4. Left: Recall-precision curves, Right: ^Queries-precision curves 




Fig. 5. Comparison with random selection 



the right panel of Fig. 4. The rate of improvement decreases when the number 
of queries is larger than three. 

Fig. 5 shows a comparison of the performances of our greedy method and 
random selection in terms of recall-precision curves. The graph shows the greedy 
method also outperformed radom selection with respect to this measure. 

Are the query items selected by our greedy method effective for the predic- 
tions of correlation-based methods? According to the results of our experiments, 
this is true for the correlation-based method using the item mean. (See Fig. 6.) 

5 Future Work 

The following issues must be considered in future works. 

1. Better estimation of the covariance matrix R 

In order to estimate R from a user’s preference dataset X, we must deal with 
the problem of missing values. In our experiments, we replaced all missing 
values with the item mean, but this does not seem to be the best method, 
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Fig. 6. Left: Learning curves. Right: Recall-precision curves for predictions using 5 
queries 



especially in the case of large number of missing values. It is possible to 
estimate the (i,j) entry value Vij of R from only such vectors x whose zth 
and jth components are not missing, but the resultant matrix R loses the 
property of non-negative definite, which seems to have a bad influence on 
the solution. 

2. Efficient query item selection method whose solution is closer to 
the optimal one 

According to our experimental results, the performance of our greedy method 
is only slightly better than that of random selection. The development of a 
fast algorithm whose solution is closer to the optimal one is therefore needed. 

3. Online query item selection 

The methods proposed in [3, 5] are online, that is, the next query item is 
decided on the basis of answers to the previous queries. Development of such 
methods in our framework is preferable because there is a possibility that a 
set of query items more appropriate for the predictions of a user’s preference 
can be selected by using information obtained by the previous queries. 
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Abstract. The upgrade of frequent item set mining to a setup with mul- 
tiple relations - frequent query mining - poses many efficiency problems. 
Taking Object Identity as starting point, we present several optimization 
techniques for frequent query mining algorithms. The resulting algorithm 
has a better performance than a previous ILP algorithm and competes 
with more specialized graph mining algorithms in performance. 



1 Introduction 

Recently, multi-relational or structured data mining has gained much interest. 
Especially frequent structure mining similar to Apriori [1] was discussed in a 
number of recent publications, such as the gSpan algorithm by Yan et al. [10] and 
FSG by Kuramochi et al. [7]. Given a database of complex structures — in the 
case of gSpan and FSG a collection of graphs — the task of these algorithms is to 
find those substructures that occur in many of the complex structures. Already 
several years ago, Dehaspe et al. [3] introduced an algorithm called Warmr for 
frequent pattern mining in relational databases. Warmr was built on the solid 
theoretical foundations of Inductive Logic Programming (ILP). It accomplished 
similar tasks as the more recent algorithms. When comparing Warmr to graph 
mining algorithms such as gSpan, we note the following points: 

~ the greater expressiveness of Warmr: specialized mining algorithms often 
concentrate on one type of database, for example databases of labeled undi- 
rected graphs. For different kinds of structures, modified algorithms are re- 
quired. In ILP algorithms, such as Warmr, any structure can be expressed 
easily. The incorporation of background knowledge is also straightforward. 

— the choice for traditional clause based query evaluation in Warmr: in this 
case, two variables may have the same value during evaluation. In the sub- 
graph mining algorithms two nodes in a subgraph cannot be mapped to one 
node in a database graph. 

— in publications of subgraph mining algorithms [6,7,10], much attention is 
given to efficiency issues. The Warmr algorithm can be considered as a 
proof-of-concept of a framework; efficiency issues have not been given too 
much attention. 

In this paper, we will introduce a new algorithm for frequent query mining. 
While it is largely comparable to Warmr from an expressive point of view. 
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it uses techniques introduced by subgraph mining algorithms as well as new 
techniques. The main contributions of the paper are: 

— We show how the query discovery task can be changed to query discov- 
ery under Object Identity. Although the focus of this paper is not on the 
semantic consequences of this choice, we will argument that this approach 
closely matches that of subgraph mining, is very natural and does not pose 
restrictions in many data mining situations. 

— Building upon this evaluation under Object Identity, and using a tree data 
structure, we will define an order on queries that allows for more efficient 
search space traversals than the approach used by Warmr. To some extent 
this order is equivalent to that of gSpan; it is however more flexible and 
allows for some new optimizations. 

— We will show how this order can be exploited in both breadth-first and 
depth-first algorithms. For the latter case we will introduce optimizations 
that are allowed by the query ordering, including hash structures and sorting 
to reduce the cost of query evaluations that result in false. 

— We will present experimental results showing large speed-ups in comparison 
with a recent implementation of Warmr. We will also compare our results 
to those obtained by gSpan and FSG. In some cases, our algorithm obtains 
similar run times as FSG, but it does not equal the efficiency of gSpan. We 
will give some arguments for this difference in performance. 

Our aim is to use ILP formalisms that are very close to Warmr and to reach 
the efficiency of algorithms like gSpan and FSG. 

Our depth-first and breadth-first algorithms are major revisions of our previ- 
ous Farmer algorithm for mining multiple relations [8] . The algorithm in [8] was 
restricted to some variants of labeled, unordered trees and did not use Object 
Identity for query evaluation. In the breadth-first algorithm presented here, only 
the tree-like notation is reused. Restrictions that were present in the previous 
version of Farmer, do no longer exist in our new algorithm. We will however 
still use the name Farmer to denote our class of algorithms. 

2 Search Space Specification and Object Identity 

We will introduce some notation. Any capital A denotes an atom. An atom set 
S is an unordered set of atoms. An ordered atom set is called a query and is 
denoted by a capital Q. With (Q, A) we denote the query Q to which atom A 
is concatenated. With last{Q) we denote the last atom of Q. The variables in 
A = last{Q) that do not occur in Q\A are called the new variables of A in Q. 

Every predicate p is considered to be typed: each argument has a type. A 
variable or constant that is used as an argument of a predicate, has the same 
type as the argument. Types are frequently used in ILP systems to allow the 
definition of more narrow search spaces. With var{S, T) (or var{Q, T)) we denote 
the set of all variables of type T in an atom set S. 

We will first introduce a mechanism that defines the search space of queries 
that our algorithm will investigate. It uses a similar mode mechanism as Warmr. 
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Definition 1 (Bias). A mode declaration p(ci, ... ,Cn) consists of a predicate 
with arguments Ci, each of which is either ‘+’ (input), (output) or ‘ff’ (con- 
stant). The bias B of a search space consists of: 1 ) the type definitions of predi- 
cates, 2 ) a set of modes A 4 , 3 ) an operator const{T) which defines for each type 
T a set of constants, 4 ) a function max which assigns an integer to each predicate 
in M, and 5 ) one atom k(X) (this atom is called the key of the search). 

Definition 2 (Search space). Given a bias B, a query Q and an atom A = 
p(ti, . . . ,tn), atom A is a (mode) refinement of Q iff there is a mode M = 
p{ci, . . . , Cn) G M. such that for every 1 <i <n either: 

— ti is a variable in var{Q,Ti) and Ci =‘-h’, or 

— ti is a variable not in UjVar(Q,Tj) and Ci = ‘-’, or 

— ti is a constant in constiTi) and Ci = ‘ff’. 

Here, is the type of argument position i. The search space S{B) defined by 
a bias B consists of all queries Q that can be built iteratively starting from the 
key atom k{X) using valid refinements. Each atom A that is added to a query Q 
should satisfy the following restrictions to be a valid refinement: 1 ) A is a mode 
refinement; 2 ) A does not already occur literally in Q; 3 ) the predicate p used in 
A does not occur more than max(p) times in the new query. 



b 




b c a a baa b a 

G1 G2 G3 G4 

Fig. 1. Directed, edge labeled graphs. 



Example 1 . As an example we will use the representation of a directed, edge 
labeled graph using a predicate e(G, N, N, L). Graph G1 in Fig. 1 can be repre- 
sented using the following facts: 

K = {k{gi),e{gi,m,n2,a),e{gi,n2,ni,a),e{gi,n2, n^, a), e(gi, na, ni,b), 

e(gi,ti3,n4,6),e(5i,n3,ri5,c)}. 

Of course, the choice of constants Ui is arbitrary here. Using the set of modes 
M = {e(-|-,-,-,#),e(-|-,-l-,-,#),e(-|-,-l-,-l-,#)} the following queries can be 
constructed: 



Q 2 = k{G),e{G, Ni,N2, a), e(G, N2,Ns, a), e(G, Ai, 7V4, a), e(G, A4, A5, b), 
Q3 = fc(G),e(G,Ai,A2,6),e(G,A2,A3,a),e(G,A3,A2,a),e(G,A3,A4,a), 
Q4 = fc(G), e(G, Ai, A2, b), e(G, A2, A3, a), e(G, A3, A2, a); 



they correspond to graphs G2, G3 and G4 in Fig. 1. 

Given a knowledge base K the support of a query Q can be defined as: 
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supportK(Q) = #{6» I K ^ Q9}, 

where 0 is a substitution to constants of all variables in the key of Q; Q6 denotes 
the application of this substitution to Q. 

In Warmr, to compute the \= relation, a Prolog engine based on 0-sub- 
sumption is used. For a knowledge base containing only facts, this evaluation 
comes down to the discovery of a substitution such that Q9 C K. On the other 
hand, we will use an evaluation technique based on subsumption under Object 
Identity (Ol-subsumption). Under Object Identity, the satisfying substitution 9 
is constrained in two ways: no two variables in Q may be mapped to the same 
constant, and no variable may be mapped to a constant already occurring in Q. 

We will briefly illustrate some consequences of this choice. Under usual 9- 
subsumption, example query Q 2 is a consequence of the knowledge base K, as Q 2 
can be satisfied by mapping iVi — >■ U 2 ^N^ — >■ U 2 ,N 2 — >■ n\,N^ — >■ ni, Afj — >■ 71.3. 

In the graph notation of Fig. I, some nodes in G2 are mapped to the same nodes 
in GI. Under Object Identity, this is not allowed: the mapping must be injective. 
Such an injective mapping is also used in gSpan for labeled undirected graphs. 
Similar arguments show that G3 is not included in Gl under Object Identity, 
while it would be included under traditional 0-subsumption. 

An important issue is that of query equivalency. In general, two queries Q\ 
and Q 2 are equivalent iff for every possible knowledge base K-. K \= Qi ^ K \= 
Q 2 - For evaluation without Object Identity, one can prove that Qi and Q 2 can 
only be equivalent when Q\ and Q 2 mutually subsume each other. Without 01, 
G3 and G4 in our example are equivalent. Every graph which contains G4 also 
contains G3, as node iV4 can always be mapped to the same node as N 2 - The 
first reason for choosing Object Identity is that these counterintuitive situations 
are prevented under 01. Under 01 queries are equivalent iff they are alphabetic 
variants [4,5]. We will define this equivalency relation more precisely. Given a 
query Q, let vars{Q) denote the set of all variables occurring in Q and let 
varlf{Q) denote the list of all variables Q in order of first occurrence. 

Definition 3 (Equivalency of queries). Given a query Q, the normally 
named query n{Q) is the query Q to which the following renaming substitution is 
applied: 9 = {V /Vi\V G vars{Q),i = ord{V,Q)}. Here ord{V,Q) is the position 
ofV in varlf(Q). Two queries Qi and Q 2 are equivalent (denoted by Q\ = Q 2 ) 
if there exists a permutation tt of the atoms in Qi such that n{n{Qi)) = n{Q 2 ). 

To determine whether two queries are equivalent, is therefore ‘only’ a problem 
of finding a permutation which transforms the one query into the other. This is 
still a difficult problem; it can be shown that to compute whether two queries are 
equivalent, one has to solve a graph isomorphism problem, and vice versa. The 
complexity of graph isomorphism is currently unknown: no polynomial algorithm 
is known, and a proof of NP completeness does not exist either. In comparison 
with full 0-subsumption, however, 01 makes the computation of equivalency 
slightly easier. This property is the second reason for choosing 01. 

We will now present our pattern mining task. 
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Definition 4. Given a bias B, a knowledge base K and a threshold minsup, 
Farmer should discover a set of queries Q such that for every Q G S{B) with 
supportK(Q) > minsup, there is exactly one Q' G Q such that Q' = Q. 

A query for which support k(Q) > minsup is said to be frequent. The single 
query in Q to which a query Q is equivalent is considered to be its normal form 
or its canonical label. 

The third advantage of 01 can be understood by considering Q 2 in con- 
junction with the following modes, which define a search space of edge labeled 
trees: {e{+, ff), e(+, -I-, — , #)}. Query Q 2 is not equivalent with any smaller 
query. Every subquery Q 2 € S{B) of Q 2 with IQ 2 I = IQ 2 I + 1 is however equiv- 
alent with a query smaller than \Q' 2 \- An algorithm which relies on refinement 
with building blocks of one atom, will not construct Q 2 if it removes equivalent 
queries immediately. Such difficulties with refinement are avoided under 01. 

The choice for Object Identity has many consequences on the types of pat- 
terns that can be discovered. As an illustration consider a situation in which one 
also allows wildcards as labels. A possible query in this case would be: 



Under full 01, all labels Li, L 2 and L 3 must be different. Although for clear 
objects (such as nodes), an inequality constraint is a natural choice, for properties 
(such as the label of an edge) inequality can be undesirable. An elegant solution 
could be to use a variant of Object Identity which does not force 01 on variables 
for such properties; in this weaker 01 , one can sometimes (and also in this 
example) still guarantee the three properties of Object Identity that we exploit. 
Due to lack of space, we refer to [9] for more details about 01 related issues. 





e(Vl,V2,V3,a) 



e(Vl,V2,V3,b) 




1. 2. 2. 2.* 2. 3. 2.* 2. 2. 2. 3. 



Fig. 2. A query tree. 



3 A Tree Based Normal Form 



In our algorithm, all queries are stored in an ordered tree as given in Fig. 2. 
Every node in this tree is labeled with an atom. Every path starting starting in 
the root represents a query. Every node has therefore an associated query. Once 
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a query is counted, its support is stored in the associated node. The query tree is 
similar to the query pack tree used by Warmr for efficient query evaluation [2] . 
By introducing an order on nodes in the tree, Farmer adds as main application 
of the tree the efficient determination of candidate queries. The order of queries 
is determined by their order in the tree: 

Definitions (Order of queries). Let Q\ = (Qp, Ap, Ai, Q[) and Q2 = 

{Qp, Ap, A2,Q'2}, Ai yf A2 be two queries (where Qp, Q'^ and Q'2 may he empty), 
then Qi <t Q2 iff Ai < A2 in the child list of (Qp,Ap). 

If Qi <T Q2, then Qi is called an earlier query than Q2 or Q2 is called a 
later query than Q\. 

An outline of the Farmer algorithm is given in Algorithm 1 and 2. In line (5) 
of Algorithm 1, the order in which nodes are expanded is intentionally left un- 
specified. The order is only restricted by the precondition of Farmer-Expand. 
In line (5) of Farmer, and line (3) of Algorithm 2 this observation is used: 
VQ 2 : (3Qi C Q 2 : support(Qi) < minsup) support{Q 2 ) < minsup. 



Algorithm 1: Farmer 

Input: A bias B, a knowledge base K and a threshold minsup. 

Output: A tree T with all queries according to Definition 4. 

(1) Read K and determine constfT) for each type T 

(2) T := a tree with only the key atom in the root 

(3) repeat 

(4) Count the frequency of all uncounted queries. 

(5) for one or more uncounted, unmarked, frequent leafs do 

(6) Expand that leaf 

(7) until T contains no uncounted queries 

(8) Remove all marked nodes 

Algorithm 2: Farmer-Expand 

Input: A query Q in a tree T with counts for (1) all ancestor queries of 

Q, (2) all earlier queries Q' , \Q'\ < |Q|; (3) all later queries Q' which are a 

brother of an ancestor of Q. 

Output: A query tree with uncounted expansions Q' of Q, \Q'\ — \Q\ -1- 1. 

(1) Let A be last{Q); let Ap be the parent of A and Qp the query 
associated with Ap. 

(2) Add as child of A all valid rehnments A!' = last{n{Q, A')), where 
A' is either: 

(3) 1. a frequent atom occurring after A in Ap’s child list, where new 
variables in A! are renamed such that they are also new in (Q,A'). 

(4) 2. a dependent atom, which is any atom that uses at least one vari- 
able that was new in A. 

(5) 3. a copy of A if A has new variables; those new variables are given 
new names in the copy. 

(6) Remove the new child query A" if it is equivalent to an earlier query, 
unless the child is only equivalent to a brother. In that case A is 
marked but kept in the tree. 




356 Siegfried Nijssen and Joost N. Kok 



It is this property that has led to the popularity of ApRiORi-like algorithms: 
this property restricts the search space in such a way that is possible to compute 
all frequent patterns if the threshold is not too low. 

We will first consider the resulting tree T when all queries are frequent and 
Farmer-Expand line (6) is absent. We will show that for every query in the 
search space, at least one equivalent query can be found in this tree. 

Example 2. Under these assumptions, and given modes 
e(+,— ,— ,#) and e(+, Fig. 2 shows for each query of length 2 how 
they are obtained from queries of length 1 by applying Farmer-Expand. Each 
number indicates which of the three possibilities is applied to generate a new 
atom. 

Lemma 1. Given is a query Q which occurs in a Query Tree T generated by 
Farmer, and an atom A ^ Q which is a valid refinement of Q. Then a query 
Q' = n{Qi, A,Q 2 ) exists in the tree T, for some subdivision of Q into Qi 
Q 2 , Q = {Qi,Q 2 )- Furthermore, Q is either a prefix of Q' or Q' <t Q- 

Proof. As A is a valid refinement of Q, there is a prefix {Qp, Ap) of Q such that 
the normalized atom A! = last{n{Qp, Ap, A)) is a dependent atom of Ap. This 
dependent atom is generated in line 4 of the Farmer-Expand algorithm. If Ap 
is the last atom of Q, our statement is clear. Therefore assume that Ap has a 
different successor Ap+i in Q. This atom Ap+i is also a child of Ap in T. Consider 
the order of A' and Ap+i in the list of children: 

— if A' occurs before Ap+i, Ap+i is a right-hand child of A'. The copying 
mechanism in line 3 will copy Ap+i as a child of A'; all steps which created 
Q are applicable subsequently and result in a query Q' . 

— if A' equals Ap+i, both have output variables. In line 5 a self-duplicate Ap+i 
of A' is generated. All steps which created Q are applicable subsequently. 

— if A' occurs after Ap+i, A' is copied as a child of Ap+i. This child of Ap+i 
may be left or right from Ap +2 (the next atom in the original query). We 
can recursively apply our arguments on the situation for p + 1 until one of 
the above conditions holds. 

Also the order of the old and new query follows from these arguments. □ 

Theorem 1 (Completeness of search). For every query Q\ in the search 
space, there is at least one equivalent query Q 2 in the tree T. 

Proof (Sketch). This can be shown by induction on the length of the query. A 
query with only the key occurs in T. By inductive assumption, an equivalent 
query for Qi\last{Qi) exists in the tree, and a corresponding variable renaming. 
When last{Qi) is renamed accordingly, this renamed atom is a valid refinement 
of the equivalent query, and one can apply Lemma 1. □ 

Two equivalent queries that still coexist without line 6 in Algorithm 2 are in- 
dicated with a (*) in Fig. 2. We will now consider the algorithm with this line 
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added. We have to prove that by removing an atom from the tree, we do not 
remove an atom that otherwise would have been used to create a query for which 
no equivalent query exists. 

Lemma 2. Let T he the tree obtained after iterative application of Algorithm 2 
without line 6. Assume that a query Q 2 is equivalent with a query Qi <t Q 2 - 
Then every query Q 2 which has Q 2 as prefix must have an equivalent query more 
left in the tree. 

Proof. As Q 2 is equivalent with Qi, there is a permutation of atoms of Q 2 
followed by a renaming 6 that makes Q 2 equal to Qi. This substitution 0 can 
be applied to all atoms in Qg = Some of these atoms are now valid 

refinements of Qi. According to Lemma 1 one by one these atoms can be added 
to Qi, yielding queries Q[ that are either extensions of Qi or occur Q[ < Q\. □ 

Theorem 2. For every query defined by the bias, Algorithm 1 generates exactly 
one normal form if all queries are frequent. 

Proof. It is clear that no two normal forms can occur: in line 6 of Algorithm 2 
and line 8 of Algorithm 1, any query which has an equivalent lower query is 
removed. Theorem 1 showed that if equivalents were not removed, the search is 
complete. According to Lemma 2, if a query Q is equivalent to an earlier query, 
all of its descendents must also be equivalent to an earlier query. Q should 
therefore not be expanded further. The only remaining function of atom last{Q) 
is its function as an expansion for earlier brothers in line 3 of Algorithm 2. In 
case Q is equivalent to an earlier query Q' which is not a brother, last{Q) is not 
required as a building block for earlier brothers: the brother atom can be added 
to Q to yield a query Q" < Q' and every expansion of Q' can also be added to 
Q" (similar to the construction of Lemma 2) . By the marking mechanism only 
those atoms are kept as building block that are equivalent to an earlier brother. 

□ 

A consequence of the monotonicity constraint is that every building block of 
a query Q must also be frequent. From this observation it follows that our 
algorithm performs exactly the task that was defined in Definition 4. 

4 Depth First and Breadth First Algorithms 

In the algorithm discussed in the previous section, many elements have been 
kept unspecified. In this section, we give an overview of some details. 

Equivalency Check. To determine whether an earlier equivalent query exists, 
we essentially use an exhaustive search algorithm. Given a query Q, the mode 
mechanism is used recursively to build queries Q' that contain atoms in Q. After 
an atom is added, the tree T is consulted to determine whether Q' is later (in 
which case Q' is not further expanded) or infrequent (in which case Q cannot 
be frequent and is pruned). Once a query Q' < Q \s found which contains all 
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atoms of Q, Q is pruned. Especially the combination of frequency pruning with 
equivalency pruning is a distinctive feature of our algorithm. Although it requires 
infrequent nodes to be stored in the tree, it could give the exhaustive exponential 
search an additional value and could reduce the number of queries that should 
be counted later significantly. 

Order of Query Expansion. We distinguish two query expansion orders: breadth 
first and depth first. In the breadth first approach, all nodes at the lowest level 
of the tree are expanded. This yields a tree in which all nodes at the new lowest 
level are uncounted. The nodes are counted next, and the process is repeated 
until no new level can be added. 

In the depth first approach, only one node is expanded; the new children 
are counted immediately. Starting with the first child, the process is recursively 
repeated. Only after the complete subtree of the first child has been constructed, 
the next child is recursively expanded. 

In both approaches, the precondition of Algorithm 2 is satisfied. Breadth first 
is the traditional approach and corresponds to the evaluation order of Apriori 
[1], Warmr [3] and FSG [6]. The depth first order matches that of gSpan. 

Query Counting. To determine whether a query is Ol-subsumed by a knowledge 
base of facts, an exponential search is required (one can easily see that this 
problem is equivalent to the subgraph isomorphism problem, which is known to 
be NP hard). Especially those queries which can not be satisfied for a given key 
substitution are computationally very expensive as many variable assignments 
have to be checked before this can be concluded. The task of the algorithm is to 
reduce the number of key substitutions which result in false as much as possible, 
and to reduce the cost of such an evaluation if the computation is required. 

One strategy to reduce the computational cost, is to overlap the computation 
of queries. Consider a query Q with several child expansions. One can backtrack 
over all possible assignments of Q as long as one of the child expansions is not 
satisfied. This is more efficient than to evaluate each child expansion separately. 

The advantage of the breadth-first approach is that the number of queries 
that should be evaluated at a certain level is maximal. For a given substitution 
of key variables, the evaluation of many queries can be combined. Our breadth 
first implementation uses this evaluation technique, which is similar to query 
packs as discussed in [2] for Warmr and [8] for our previous Farmer algorithm. 

To reduce the number of false evaluations, a substitution ID list approach 
can be used. For each query that is evaluated, one can store the list of all key 
substitutions for which the query can be satisfied. One can easily see that a query 
which is constructed from a query Q (either by copying last{Q) or by expanding 
Q) can never be true for key substitutions for which Q is false. Therefore only 
substitutions in Q's SID list need to be evaluated. 

To reduce the cost of evaluation, with each key substitution 9 one can also 
store the variable assignments that satisfy each query Q. If the backtracking 
over variables is performed in a deterministic order from left to right, one can 
continue the evaluation of each expansion of Q starting from the assignment that 
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satisfied Q without having to recompute that assignment. Some assignments are 
skipped in this way, but one can show that this can safely be done. To reduce 
the memory demand of the approach, for each query Q and key substitution 9 
we only store the difference A{9, Q) between the first variable assignment that 
satisfies Q and the first assignment that satisfies the parent in T of Q. 

Order of Children. There are many possible child orders: 

— The order in which children are generated in Algorithm 2. This is the order 
that we used in [8] and yields queries that are very well readable. 

— A lexicographical order. To determine query equivalency, one repeatedly has 
to search for a given atom in a set of children. With a lexicographical order, 
in combination with binary search and hashing, we speed up this search. 

— Sorted by support. Atoms with a lower support occur earlier in a query in 
this case, which results in a quicker evaluation of queries that cannot be 
satisfied (the most selective atoms occur earliest). 

— Sorted by backtracking progression. Consider a query Q, a key substitution 
9 and a set A{9,Q) of variable assignment changes. The position of the 
leftmost variable affected by A{9,Q) in Q is the backtracking progression 
of Q for 9. By averaging A{9, Q) over all 9 one can compute the average 
backtracking progression of each query. When a candidate query (Q, A) is 
generated by copying an atom A below a query Q, both A{9, Q) and A{9, A) 
could be used as starting point for the evaluation of (Q,A); best would be 
to always use the assignment which has backtracked most. However, when 
the evaluation of several queries is overlapped, much additional bookkeeping 
would be required. As tradeoff we always use A{9,Q) as starting point, but 
sort to make sure that the parent has backtracked most on average. 

Note that in the last two orders, some special care has to be taken in the equiv- 
alency procedure, as the order of children is only known after they are counted. 

5 Experimental Results 

From the possibilities discussed in the previous section, we implemented and 
tested several (see [9]). We implemented a breadth-first algorithm with naive 
sorting order and evaluation without substitution ID lists as a reference algo- 
rithm. Furthermore we implemented a depth-first algorithm which incorporated 
overlapping evaluation and a complex sorting order: given a query Q that is 
going to be expanded, all children of nodes that are not an ancestor of Q are 
stored in lexicographical order to allow for quick equivalency checks; nodes on 
the path corresponding to Q are also sorted first on backtracking cost, then on 
support and finally lexicographically. These two orders can be combined in an 
efficient way. From our experiments, we concluded that it is most beneficial. 

Bongard Dataset^. The Bongard dataset [2] was used to compare Warmr, 
depth-first and breadth-first Farmer (Fig. 3). In the experiments. Farmer was 

^ Experiments were performed on a Linux Pentium II 350Mhz with 192MB RAM, 
using the GNU C++ compiler, version 2.96 with 03 code optimization setting. 
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clearly several orders of magnitude faster than Warmr. One should however 
realize that in these experiments, Warmr was provided with a bias that forced 
Object Identity by adding inequality atoms. Warmr was not optimized for this. 
Part of the efficiency difference may also be due to the different programming 
language that was used (Prolog). 
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Fig. 3. Results on the Bongard dataset. Default dataset size is 392, minsup = 5%. 



Predictive Toxicology Evaluation Challenge (PTE). Execution times for 
PTE were published in [10], [6] and [7]; in these publications, labeled, undirected 
graphs were constructed from the atom and bond information; one searches 
for connected frequent subgraphs. To emulate the injective setup of gSpan and 
FSG, Object Identity is a necessity. To deal efficiently with connected, undirected 
graphs, the mode mechanism that was described in this article is not powerful 
enough. Therefore, we incorporated a more powerful declarative formalism based 
on mode trees in Farmer. Due to space limitations, we omit the details. 

Table 1 and Fig. 4 display some execution times. We also show some execu- 
tion times of other publications to set these into a perspective. Note that our 
algorithm runs on computers with relatively few memory, even though the ID 
lists augmented with variable assignments have to be stored in main memory. 



Table 1. Comparison of execution times on the PTE dataset for minsup G {6%, 7%}. 



Machine 


Algorithm 


6% (s) 


7% (s) 


Intel Pentium III 500Mhz 448MB 


gSpan [10] 


5s 




AMD Dual Athlon MP 1800-1- 2GB 


FSG Iterative Partitioning [7] 


11s 


7s 


AMD Athlon XP1600-f 265MB 


Farmer 


72s 


48s 


Intel Pentium II 350Mhz 192MB 


Farmer 


224s 


148s 


Intel Pentium III 500Mhz 448MB 


FSG [10] 


248s 




AMD Dual Athlon MP1800-I- 2GB 


FSG Inverted index [7] 


675s 


23s 


Intel Pentium III 650Mhz 2GB 


FSG [6] 




600s 



We may conclude that our algorithm does not reach the state-of-the-art per- 
formance of gSpan. Compared to other graph mining algorithms, its performance 
is reasonable. We could easily compute all frequent subgraphs down to a support 
of 3%. The performance of gSpan is hard to obtain with the more general setup 
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Fig. 4. Results on the PTE dataset. Farmer was run on an AMD Athlon XP1600+. 
For FSG results published [7] for an AMD Dual Athlon MP1800+ are used. 



that we are dealing with. For example, gSpan orders the labels on the vertices 
first and performs a depth-first search to discover the graphs with the first label 
first. Next, all vertices with the first label are removed from the database, and 
the process is repeated for the remaining graphs. In general, this optimization is 
harder to apply. Therefore, one could better use gSpan if one is exactly searching 
for the kind of patterns that gSpan is optimized for. 

Mutagenesis. The Mutagenesis dataset is very similar to the PTE dataset and 
was also used in [2]. We use it to compare Farmer with Warmr without Object 
Identity. Using a minimum support of 20%, Warmr discovers 91 frequent queries 
in 207s (of which 205s are spent while generating candidates). On the same Intel 
Pentium II Farmer discovers 1075 frequent queries in 73s. The different number 
of queries is due to the fact that Warmr does not discover graphs like C — C—C, 
as these are equivalent to C—C without Object Identity. The set of queries found 
by Farmer is a proper superset of those found by Warmr. 



6 Conclusion 

In this article we presented an efficient algorithm for discovering frequent queries. 
We used Object Identity and a tree data structure to introduce several optimiza- 
tions. Experiments showed that the algorithm outperforms Warmr and is com- 
parable with some more specialized algorithms, but is not as efficient as recently 
published graph mining algorithms. 
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Abstract. We consider the problem of identifying a user typing on a 
computer keyboard based on patterns in the time series consisting of 
keyboard events. We develop a learning algorithm, which can rather ac- 
curately learn to authenticate and protect users. Our solution is based 
on a simple extension of the well known Lempel-Ziv (78) universal com- 
pression algorithm. A novel application of our results is a second-layer 
behaviometric security system, which continually examines the current 
user without interfering with this user’s work while attempting to iden- 
tify unauthorized users pretending to be the user. We study the utility of 
our methods over a real dataset consisting of 5 users and 30 ‘attackers’. 



1 Introduction 

Many security systems rely on a single log-on entry, typically a password, for 
access. Such systems can be compromised if the password is discovered, or is 
easy to attack. Greater security is achieved by relying on a physical means of 
identification, most often an access card (which may also include a One-Time- 
Pad to generate secure passwords). But if the card is lost it too could become 
a security risk. In general, all of these systems are vulnerable to an attacker 
co-opting a user’s session; either by physically taking the place of the user, 
or by some exploitable weakness in the system. Recently, biometric methods 
have begun to appear in widespread use (see e.g. [1]). These typically rely on 
fingerprints or retinal structure. While some biometric security methods are 
considered rather safe, by and large these systems are only used for single log- 
on, and require additional hardware. 

A different class of identification methods can rely on patterns appearing in 
a user’s behavior when interacting with a machine. Possible examples could be 
driving a car, or interacting with a computer through typing, mouse control, 
navigation patterns, and so on. Such behaviometric identification is different 
from biometric identification in two respects^. On the one hand, behaviomet- 
ric measurements can be intentionally biased (or corrupted) to some extent by 

^ The field dealing with measurements, theories and analysis of patterns in all aspects 
of human behavior is called behaviometrics. 
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users who can control their behavior. On the other hand, unlike biometric mea- 
surements, behaviometric readings can be done on a continuous basis without 
interrupting or interfering with users’ activities. This possibility allows for cre- 
ating a secondary security system, which is continually operated after log-on. 
Such a system need not only be applicable to computers but to a wider array of 
other devices as well; conceivably any device with a sufficiently complex input 
system. 

The aim of this study is to examine the question of whether a behaviometric 
security method can be automatically learned by a machine. In particular, we 
focus on the problem of typist identification. While being a particular instance of 
behavior, we believe that typing can represent some essential and general issues 
in behaviometric identification. Like other types of interactions with machines 
it is suggested that every person types differently, not only having to deal with 
typing method (e.g. touch typing), but more importantly with a person’s physical 
and mental attributes. The size of one’s hands, length of one’s fingers, fine motor 
skills, language skills, and knowledge of keyboard layout could all come into 
play to affect how one types. Thus, identifying a typist is an interesting and 
challenging problem worthy of behavior analysis. 

Our solution to the learning of typist classifiers is based on a number of sim- 
ple ideas, which combine well into an effective method. We represent sequences 
of typing events as discrete sequences over finite (and rather small) alphabets, 
and then use universal prediction machines (based on known universal compres- 
sion algorithms) to generate probabilistic behaviometric models for users. Using 
these models we then solve instances of single-class classification problems. We 
describe the method and evaluate its performance over a real dataset collected 
from various typists. Our examination provides a proof of concept indicating 
that automatic learning of behaviometric identification of typists is a feasible 
task. 



2 Problem Setup and Preliminaries 

With a security application in view (as mentioned above), we model the typist 
identification problem as a single-class classification problem where we have a 
training set of typing samples from one user u and we would like to construct 
a classifier capable of distinguishing new typing sequences generated by u from 
sequences generated by other users. Usually, in this single-class setting the other 
samples (not from u) are referred to as outliers. 

In general, a single-class classification formulation is required whenever it is 
possible to acquire training examples of the target class (e.g. typing sequences 
of the user u) but hard or impossible to collect examples of the outliers (e.g. 
sequences of intruders). Thus, while the desired classifier is still binary and should 
discriminate between the target and the outliers, only one side of the boundary is 
supported by the data. Therefore single-class classification problems are harder 
(and much less studied) than standard binary classification problems (see also 
Section 6). The performance of a single-class classifier is best measured using 
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standard statistical distinctions between error of the first type <5i, giving the 
proportion of target samples which are classified as outliers, and error of the 
second type 62 measuring the proportion of outlier samples classified as target 
samples. A plausible requirement, in the context of security systems, is that the 
tradeoff between i5i and 82 is controlled by the user. 

We now characterize more formally the typing sequences we consider. The 
output generated when a user operates a keyboard is a sequence of events which 
can be described as follows. Standard keyboards usually have 104 keys and the 
keyboard outputs events when keys are pressed and released. Let K be the set 
of keys on the keyboard (that is, \K\ = 104). Each key can be in one of two 
states: pressed or released. Let A = {press, release} be this set of states. A 
keyboard event, e = {k, a), where k G K,a G A, occurs whenever a key is pressed 
or released. Let E be the set of all keyboard events. Clearly, \E\ = |AT||A| = 
2\K\ = 208. A sequence ei,e 2 ,...,e„ of keyboard events is viewed as a time 
series X\,X 2 , ■ ■ ■ ,Xn where Xi = (e^, ti) and ti is the time recorded for the event 
e,. Any such finite time series of keyboard events is called a sentence. 

3 Typist Identification via Universal Prediction 

The proposed solution to typist identification is based on universal prediction al- 
gorithms for discrete sequences. In this section we first describe a transformation 
of input sentences into a suitable representation for the use of such prediction 
algorithms. We then describe the prediction algorithm, which is obtained by 
extending a standard Lempel-Ziv compression algorithm. 



3.1 Representation via Quantized Time Differentials 

As described above each input sample is a time sequence (ei, ti), ( 62 , 12 ), ■■■, 
(e„,t„) of keyboard events. The exact times at which events take place are 
of little value. Of much greater interest is the time differential between two 
events. Not surprisingly (and as noted by others, e.g. [2]), these differentials 
contain much of the discriminative information between typists. Setting Ai = 
ti+i — ti, we transform the sentence into a sequence of its differentials so that 
{ei,ti) {ci, Ai,ei+i). The resulting sequence of triplets consisting of events 
and differentials faithfully represents the time transitions between events which 
are relevant to typist discrimination. However, we choose to use the following 
slightly different differential representation which can be uniquely determined 
from a triplet sequence. 

(ci,2\i,e2),..., (cyj— 1 , Aji— 1 , 671) ci , Ai, C 2 , A2 , ... , e^— 1 , Aji— 1 , e^ • 

This last representation (on the right-hand side) is simpler in the sense that 
it has a smaller “alphabet” size. However, while the number of events is finite 
the number of time differentials is not. First, unbounded differentials can be 
avoided by specifying that all values larger than a specific A^ax represent the 
start of a new sentence (keystrokes that are minutes apart are unlikely to be 
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related in any fashion). Second, for a given a set of sentences emitted by 
the typist u, which are to be learned, an additional transformation on the time 
differentials is performed with the goal of limiting the number of time differentials 
and smoothing over them. Fewer symbols make the data easier to learn, by 
reducing statistical sparseness (and thus reducing variance). To accomplish this, 
vector quantization [3] is used to cluster the time differentials into Q clusters, 
with Q centroids Ci,C 2 ,...cq. The time differentials then undergo the following 
transformation: 

A ^ c* where c* = argmin \A — Ci\. 

Ci 

This transformation is used on all sentences that are to be learned or ranked 
by the user’s model. Thus, the final makeup of a sentence is {ei, gi, 62, 92, 
6n}, where qi € {ci, . . . , cq} represents some time differential A. Therefore, 
the number of symbols in our alphabet is \E\ + Q. Note that the number Q of 
centroids becomes a parameter of the algorithm. 

3.2 Lempel-Ziv Universal Prediction 

Having represented a keyboard event sequence as a sequence of discrete symbols 
over a finite alphabet, we can now use any universal prediction algorithm for dis- 
crete sequences to generate conditional likelihood estimates of unseen sequences. 
Specifically, given a set of training sentences for user it, we use a universal 
prediction algorithm to train a model which is then capable of estimating 
Pr(a;|U„), the conditional probability distribution of an unseen sentence x. Using 
such conditional estimates we then solve the single-class classification problem. 

There are a number of universal prediction algorithms whose empirical per- 
formance for lossless text compressions is considered state-of-the-art. Notable ex- 
amples are the context tree weighting method (CTW) [4] , the Burrows- Wheeler 
Transform (BWT) (see e.g. [5]) and variants of Prediction by Partial Matching 
(PPM) [6]. For simplicity and for computational efficiency we compromise like- 
lihood estimation accuracy and rely on the Lempel-Ziv algorithm (lz78 [7]). In 
particular, we use the prediction component of the lz78 algorithm as described 
in [8]. Besides being very simple and fast this algorithm enjoys performance 
guarantees of various types (see e.g. [9]). We also propose two improvements to 
the algorithm, which appear to increase its prediction accuracy. 

The lz78 Universal Prediction is a one-pass algorithm. It builds a weighted 
tree from sequences over a finite alphabet, and can assign probability estimates 
to new sentences given such a tree. The lz78 phrase tree holds a “dictionary” of 
phrases parsed from the training sequence and is constructed by parsing input 
sequences as follows. At each stage the algorithm parses the smallest prefix which 
is not yet in the tree. For example, the string “ababbac” is parsed into: a, b, ab, 
ba, c. This set of phrases can be viewed as a phrase tree such that each parsed 
phrase is a path from the root to a leaf (see [8] for a detailed exposition) . 

As described in [8] the phrase tree can be extended to provide count statistics 
by adding a counter to each node. These count statistics can be used to calculate 
a probability estimate for traversing from a parent node to one of its children. 
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Given a set of sequences, Si, . . . , Sfc, emitted by some source (say the user 
u), a parse tree with appropriate counter statistics can be constructed for all Si 
(e.g. by concatenating the Si into one long sequence). The resulting statistical 
model is denoted by M^. Mu can be used to compute the conditional probability 
Vr{x\Mu) ~ Pr(a;|si, . . . , Sfe) of a new sequence x. This is done by traversing 
down from the root according to the letters of a;, and multiplying the probability 
estimates of the traversals, until a leaf is reached. Then the traversal resumes 
from the root. In practice, the normalized (negative) log-likelihood is used. 



V{x,Mu) 



-log2Pr(x|M„) 



( 1 ) 



This value is non-negative for all x and is 0 (for finite length strings) only when 
Pr{x\Mu) = 1, which is the ideal prediction for any sentence emitted by u. 



3.3 Improvements to Standard LZ Prediction 

A major advantage of the lz78 parsing technique is its speed. This speed is 
possible by compromising a systematic consideration of all substrings. While 
for very large training sets this compromise will not affect the results signifi- 
cantly, for small training sets (and short test sequences) this results in sparser 
and noisier statistics. We propose two simple modifications to the algorithm 
which increase the number of phrases extracted and improve performance of the 
lz78 estimation. The two modifications are termed input shifting, and back-shift 
parsing. 

Input shifting is used during the learning process to extract more phrases 
from a sentence. Considering a sentence x = X 1 X 2 ■ ■ ■ Xn, the sentence is parsed 
once as described above. Then it is parsed s more times, where in the ith addi- 
tional parsing we parse the suffix Xi+iXi +2 • • • in the usual way (but starting 
with the aggregated model constructed by previous parsings) . The effect of input 
shifting is to increase the number of phrases thus making the phrase tree larger. 
As s grows so does the height of the tree as longer and longer phrases are parsed. 
Note that by taking s = 0 we leave the lz78 algorithm intact. 

Another deficiency of the lz78 algorithm is the loss of context when pars- 
ing a sequence (and when calculating the likelihood of a sequence). Specifically, 
each time the algorithm returns to the root (see description in Section 3.2) af- 
ter parsing a phrase in a sequence, the entire context consisting of previous 
symbols is lost. In order to remedy this, we propose a method which utilizes 
the last m letters parsed to provide a prior context for the next phrase (tak- 
ing m = 0 leaves lz78 intact). This method, which we term Back-shift parsing 
(BSP) seeks to achieve this by back-shifting m letters after parsing each phrase. 
This approach is problematic for m > 1, however, since more letters may be 
back-shifted than parsed (which occurs often in practice). This seriously im- 
pedes progress and compromises speed, which is one of the advantages of lz78. 
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We prevent this by requiring that the 
m letters come from the last phrase 
parsed. This slight change is imple- 
mented by utilizing a “marker” as 
described in Figure 1. The resulting 
Back-Shift Parsing with a marker pre- 
vents back-tracking beyond the marker 
thus guaranteeing rapid progress. The 
overall effect is to quickly build a tree 
with no path shorter than m -I- 1 in 
length, or to make the tree deeper 
while minimally affecting its width. 

BSP also affects the calculation of a probability estimate for an unseen sen- 
tence. Instead of returning to the root after traversing to a leaf, the last m letters 
traversed are first traced down from the root to some node v, and then the new 
traversal begins from v (if v does not exist, then the new traversal continues 
from the root instead). 

The modified algorithm now has two parameters and is denoted by lz78(s, m) 
where s determines the number of input shifts and m determines the context 
length for back-shifting. The following example shows the parsed phrases gener- 
ated by some lz78(s,m) algorithms for the sequence “ababbac”. Note that the 
phrases appear in the order of their parsing. 



Algorithm 


Phrases Parsed from “ababbac” 


lz78(0,0) 

lz78(l,l) 

lz78(2,2) 


{a,b,ab,ba,c} 

{ a , ab , b , ba , abb , bac , c , bab , bb } 

{ a , ab , aba , b , ba , bab , abb , bb , bba , bac , ac , babb , bbac , abba} 



Initialization: marker = start of sentence 

Repeat until no more phrases to be parsed: 
phrase = next phrase parsed 

(starting at marker) 
add phrase to dictionary 
if (length(phrase) > m) 

marker = marker + length(phrase) - m 



Fig. 1. Pseudo-code for Back-Shift Pars- 
ing (BSP) with a marker. 



3.4 Single-Class Classification and Model Selection 

Let Du = {S'!, . . . , S'n} be a training set of sentences emitted by u. Given a fixed 
choice of the parameters s and m we use the lz78(s,m) algorithm to build a 
model Mu = M{Du, s,m) for the user u (thus, for a particular user, a model 
corresponds to a choice of s and m). This model can provide likelihood esti- 
mates for unseen sentences. Given an unseen sequence x we should determine 
whether Pr(a;|M) is sufficiently large to “accept” x (alternatively, that V{x,Mu) 
is sufficiently small; see Eq. (1)). To this end, a cutoff point, or threshold t, is 
necessary. We determine a threshold using the following leave-one-out method- 
ology. For each training sentence Si in Du we calculate the likelihood of Si given 
a model trained on Du excluding Si. More formally, for each Si € Du let 

Vi = V{S,,M{Du\{S,},s,m)) . (2) 

Let and aM be the empirical average and standard deviation of the E,, 
i = l,...,n. An ideal (but perhaps not achievable) threshold t places all (fu- 
ture) user’s sentences below the cutoff and attackers’ sentences above. Given 
the evidence we have (the training sentences for u) we attempt to guarantee 
results for the user by setting the threshold to t{M) = + ka(JM where ka is 
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sufficiently large. Using Chebyshev’s inequality, for any we can provide for 
u a confidence level as follows. For any random variable X whose mean and 
standard deviation are /i and cr, respectively, a one-tailed version of Chebyshev’s 
inequality [10] states that for any fc > 0, Pr{X — /i > ka} < Consider- 

ing future sentences emitted by u as observations of a random variable S and 
taking ^im and ctm as estimates of the true mean and standard deviation of 
the random variable V{S,M{s,m,Du)), we have for any choice k„ (and using 



t{M) = + kaOM), Pr{U(5', M(s,to, £)„)) > t{M)} < Thus the confi- 
dence level is 1 — 5=1 — 1/(1 -I- = fc^/(l -I- k'^). 



To summarize, our typist identification algorithm has four parameters: Q, 
the quantization level; m and s, the parameters of the improved lz78 algorithm; 
and ka, which determines acceptance threshold. Our goal is to set values to these 
parameters based only on the training set 

Within a minimax setting, we choose the best model which maximizes the 
likelihood of the “hardest” training sentence. Specifically, we take 



This optimization determines values for the parameters Q, m and s. The param- 
eter ka is set such that the maximum Vm value in (3) is just below the threshold 
t{Mf) and will be accepted by the model. Specifically, maxsg£>^ Vm{S,Du) = 
Mm + ka<JM and solving for ka we get 



In addition to the above single-class setting we also consider a setting where 
a (small) set of “attacker” sentences is available for training. Clearly, if such 
a set of “outliers” is not very large, it is not likely to faithfully represent the 
general statistics of outliers. However, it is interesting to investigate whether 
this additional piece of information can be exploited to improve performance. 
Although this problem is typically not a standard two-class problem, for the 
rest of the paper we call this setting the ‘two-class’ setting. Denote by Da the 
set of attacker sentences available for training. For each model M, let tM{ka) = 
Mm + ka<JM where mm and ctm are estimated as described above. The accuracy 
of the model M with respect to the decision threshold is given by the ratio of 
the number of correctly classified strings to the total number of strings. 



Let e be any limit on the desired accuracy (0 < e < 1), The robustness Rs{M) 
of the model M is defined as 




( 3 ) 



max5g Vm» {S, Du) ~ fJ-M* 



a M* 



A{M, ka) 



\{V{x, M) \v < tMjkg), x&Du}\ + |{g I v{x, M)>tM{kg), x£Dg}\ 

\Da\ + \Da\ 




k>0 :A{M,k)>l—e 



A{M, k)dk. 
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That is, the robustness is the area below the accuracy curve viewed as a func- 
tion of threshold magnitude. Note that in practice the robustness can be rather 
accurately estimated using the average accuracy of the model over a number 
of suitable k„ representatives. Figure 2 depicts the accuracy curves of various 
lz78(s,m) models. The areas enclosed by these curves and the 90% asymptote 
are the e-robustness values of these models (with e = 0.1). 





Fig. 2. Accuracy as a function of threshold magnitude for various lz78(s,m) models. 
Areas above the 90% asymptote are robustness values. For example, i?o.i(lz78(0, 1)) 
and i?o.i(lz78(l, 0)) are the largest robustness values in the left and right panels, 
respectively. 



The model M* which maximizes robustness is selected in this two-class 
setting and is set to maximize accuracy as measured over the training set 
U Da- That is, k„ = argmax;, A{M*, k). Note that there may be more than 
a single value which gives the maximum accuracy. In this case, there may be 
several peaks in the accuracy curve. We note however that in practice a single 
broad peak is typically observed. Whenever there is more than one maximum, 
we heuristically choose the threshold as the midpoint of the widest peak. 

4 Dataset and Experimental Setnp 

For evaluating the proposed algorithms a dataset of keyboard event sentences 
was collected from 5 users and 30 attackers. We note that the recording of 
keyboard events including their precise time stamps is not straightforward us- 
ing user-level programs on most standard operating systems. Thus, a suitably 
adapted system was constructed including a modified keyboard interrupt service 
routine^ . Each of the users and attackers typed several sentences. The user input 
sequences were on average longer than the attackers’ input. The text typed by 

^ In particular, a Linux system was used with all non-essential modules and services 
removed or disabled. System calls were used to request unbuffered keyboard events. 
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users corresponded to answers to open ended questions (e.g. “What did you do 
today?”) and to a specific sentence (“To be or not to be. That is the question.”). 
Additionally a (completely) free text section was also allowed. On average, each 
user recorded 2551 ± 1866 keystrokes. Each of the thirty “attackers” was asked 
two open ended questions, and was required to type the specific sentence “To 
be or not to be. That is the question.” They were also allowed to type in free 
text. On average 660 ±597 keystrokes were logged for an attacker^. To maximize 
the utility of this dataset, the sentences, both before learning and before testing, 
were split into segments of 100 keystrokes (arbitrarily set). Additionally, all of 
the attackers’ sentences (120 in total) were used to attack each model selected. 

We selected a set of “feasible” parameter values for the models'^. To maximize 
evaluation accuracy we used the following leave-one-out protocol: For each user 
u, each of the sentences in was in turn selected to be in the test set and the 
rest of the sentences remained in the training set. Once a model was selected, it 
was tested if the model can identify and accept the left out sentence. 

For the two-class problem, where we wish to see if providing attacker data can 
improve performance, the attackers were partitioned into two groups: a group of 
10 attackers to be used for training (40 sentences), and a group of 20 attackers 
for testing (80 sentences). Other than the partitioning of the attackers, testing 
was identical to that of the single-class case, although the 10 attackers used for 
training were not used for testing. One hundred cross-validation folds were made. 

5 Experimental Results 

We begin by considering the results obtained for the single-class setting. Table 1 
specifies the results in the single-class setting. As can be seen, impressive per- 
formance can be achieved by the system. The system performs well even when 
limited information is available (for example, user 5), though performance, par- 
ticularly in self identification, does slightly suffer. Table 2 shows the results for 
the two-class experiments. Performance, on average, was similar to the single- 
class results, though user 5 did have a marked decline in self identification suc- 
cess. This does not seem to be dependent on the amount of data available, as user 
I’s performance also dropped, though less significantly. Performance in terms of 
successfully defending did improve, however, achieving perfect scores for nearly 
all of the users, which resulted in a higher break-even point. 

In addition, we examined the performance of the algorithm when models were 
restricted to use the “pure” lz78 algorithm (i.e. the lz78(0, 0) model with Q and 
fco- still variable was trained with the same methodology), both for the single- 
class and two-class problems. Due to space limitations we only report on the 
estimated break-even points for these experiments which were 93.57 and 96.42 
for the single-class and two-class problems, respectively. These results indicate 

® The complete dataset will be available at 
http : //www. cs . technion. ac . il/~rani/typist. 

^ The particular values we tested are Q = 80, 90, 100, 110, 120 ; m = 0, 1, 2, 3, 4; s 
= 0, 1, 2, 3, 4; and ka = 0, 0.25, 0.5,0.75,. . . ,10. 
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Table 1. Single-Class Results: Individual users, averages and estimated break-even 
point (defined to be the harmonic mean of the averages). 



User 


^ Keystrokes 


^ Sentences 


# Self 
“Attacks” 


Self ID (%) 


# True 
Attacks 


Defense(%) 


1 


5344 


13 


114 


97.37 ± 2.79 


1560 


98.33 ± 1.08 


2 


4156 


16 


90 


97.78 ± 8.53 


1920 


100.0 ± 0.63 


3 


1630 


5 


36 


94.44 ± 11.65 


600 


99.67 ± 0.41 


4 


1076 


5 


23 


91.3 ± 10.85 


600 


99.33 ± 0.97 


5 


548 


5 


14 


92.86 ± 9.58 


600 


97.0 ± 3.82 


Averages 


94.75 ± 2.51 




98.87 ± 1.09 


Estimated Break-Even Point 


96.77 



Table 2. Two-Class Results: Individual user, averages and break-even point. Results 
are across all 100 cross-validation folds. 



User 


^ Keystrokes 


Sentences 


# Self 
“Attacks” 


Self ID (%) 


# True 
Attacks 


Defense(%) 


1 


5344 


13 


11400 


93.86 ± 7.01 


104000 


100.0 ± 0.0 


2 


4156 


16 


9000 


100.00 ± 0.0 


128000 


100.0 ± 0.0 


3 


1630 


5 


3600 


97.22 ± 11.45 


40000 


100.0 ± 0.0 


4 


1076 


5 


2300 


95.65 ± 11.23 


40000 


99.75 ± 0.5 


5 


548 


5 


1400 


85.71 ± 17.2 


40000 


100.0 ± 0.0 


Averages 


94.49 ± 4.83 




99.95 ± 0.1 


Estimated Break-Even Point 


97.14 



that the lz78(s, m) modifications have a significant advantage in the single-class 
setting, particularly when there is little data available for training. For example, 
for users 4 and 5, the estimated break-even points for the “pure” lz78 algorithm 
are 90.2 and 80.9, respectively. With our improvements the values obtained for 
these users are 95.1 and 94.9, respectively. 

6 Related Work 

There is quite extensive literature on “keystroke dynamics” by attempting to 
identify characterizing features in keystroke sequences. One of the earliest works 
is [11], which introduce the use of “digraph times” in this context. For each pair 
of keys typed, its digraph time is the interval between the pressing of the first key 
and the pressing of the second. Many other works later use this basic idea or its 
extensions to “trigraphs”, etc. Due to space limitations we limit the discussion 
here to two of the most recent papers, which present the most impressive re- 
sults to-date. The work presented in [2] uses a combination of digraph times and 
keystroke latencies to generate feature vectors. Factor analysis is then used to 
select discriminative features. Using a nearest neighbor approach together with 
clustering, the authors examine the classification success rate of a number of 
distance functions. On a dataset consisting of 63 users, the best results are ob- 
tained using a Bayesian distance function. The stated results are approximately 
92%. These results were obtained over a dataset where all users typed fixed text 
selections from “a list of phrases” . There was also a free text component in this 
study though results are not presented and are stated to be inferior. The recent 
results of [12] consider again identifying typists of a fixed phrase. This phrase 
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consists of 683 characters (which form 125 words). Using a fixed trigraph vector 
representation the authors obtain very high accuracy using a heuristic distance 
measure between trigraph vectors. Their best results for the single-class prob- 
lem are 1.8% false alarm rate and 0.01% for “imposter pass” rate. The authors 
also test higher order “graphs” and experiment with subsets of the fixed phrase. 
While higher order graphs (e.g. 6-graphs) do not improve results, the use of 
sub-phrases can drastically increase the false alarm rate (e.g. by taking 1/4 of 
the phrase the false alarm rate increases to more than 12%). While these two 
works indicate that very high precision can be obtained in recognizing keystroke 
“signatures” over a fixed text, these methods fall short in handling free text, 
particularly when little data is available. The main contribution of the present 
work is in showing for the first time a new representation and algorithms that 
can attain very high accuracy also for free text. The results we obtain (e.g. over 
96% break-even for the single-class authentication problem) enable a practical 
behaviometric security system for continual non-intrusive authentication, which 
can handle any text. These results are not directly comparable to the above re- 
sults. However, when considering sample sizes and accuracy, it appears that our 
results may be significantly better than the results of [12]. Nevertheless, these 
other results are obtained with an impressive database consisting of typed sen- 
tences from 44 users and 110 attackers whereas our primarily free text dataset 
consists of 5 users and 30 attackers. 

As noted previously, the more challenging and perhaps common setting for a 
security system as described here, is that of a single-class problem. This variant of 
binary classification has various other jargon names, such as: novelty detection 
outlier detection, one-class classification. For other approaches for setting the 
boundary in single-class problems see e.g. [13-15]. 



7 Conclusions and Future Work 

We have introduced an approach to modeling keystroke dynamics of users based 
on using the universal Lempel-Ziv compression algorithm as a generator of the 
predictive distribution of future strings, based on statistics collected from an 
individual user. We use this predictive distribution in the context of single- 
class learning, where particular values for the augmented Lempel-Ziv algorithm 
are selected based on cross-validation. While previous work tended to focus on 
fixed representations based on A-graphs which can be considered to be fixed 
order Markov models, our representation allows for variable length contextual 
information. As a result of this, our statistical model is capable of retaining more 
robust statistics, possibly at the cost of increased space requirements. 

Our method can be potentially improved in several ways. First, other univer- 
sal prediction algorithms (such as CTW; see Section 3.2) could perhaps improve 
prediction accuracy, at the expense of speed. Such a compromise may be unac- 
ceptable for the particular application of continual non-intrusive authentication. 

It may be interesting to investigate whether taking relative time-differentials 
(rather than absolute time-differentials) can improve performance, perhaps by a 
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reduction of the variance caused by the variability in typing speeds of users. This 
direction is particularly promising when considering the successful technique of 
[12], which achieved impressive performance on a fixed text by ignoring absolute 
differential times (but utilizing the relative sizes of trigraph times). 

While our results are impressive, they can only be viewed as a proof-of- 
concept due to the limited sample size, and the use of a single session for data 
acquisition. Finally, an advantage of our techniques is that they are not specifi- 
cally targeted to the keyboard, and can be easily extended to other devices. 
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Abstract. Data mining for spatial data has become increasingly important as 
more and more organizations are exposed to spatial data from sources such as 
remote sensing, geographical information systems, astronomy, computer car- 
tography, environmental assessment and planning, etc. Recently, density based 
clustering methods, such as DENCLUE, DBSCAN, OPTICS, have been pub- 
lished and recognized as powerful clustering methods for data mining. These 
approaches have run time complexity of O(wlogn) when using spatial index 
techniques, R* tree and grid cell. However, these methods are known to lack 
scalability with respect to dimensionality. In this paper, a unique approach to 
efficient neighborhood search and a new efficient density based clustering algo- 
rithm using EIN-rings are developed. Our approach exploits compressed verti- 
cal data structures, Peano Trees (P-trees'), and fast P-tree logical operations to 
accelerate the calculation of the density function within EIN-rings. This ap- 
proach stands in contrast to the ubiquitous approach of vertically scanning hori- 
zontal data structures (records). The average run time complexity of our algo- 
rithm for spatial data in d-dimension is O(dnyfn) . Our proposed method has 
comparable cardinality scalability with other density methods for small and 
medium size of data, but superior speed and dimensional scalability. 



1 Introduction 

With the rapid growth of large quantities of spatial data collected in various applica- 
tion areas, such as remote sensing, geographical information systems, astronomy, 
computer cartography, environmental assessment and planning, efficient spatial data 
mining methods are in great demand. Density based cluster algorithms have been 
widely used in the mining of large spatial data. Density based cluster algorithms 
group the attribute objects into a set of connected dense components separated by 
regions of low density. A cluster is regarded as a connected dense region of objects, 
which grows in any direction that density leads. Density based cluster algorithms have 
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been recognized as a powerful clustering approach capable of discovering arbitrary 
shape of clusters as well as dealing with noise and outliers for spatial data mining. 

There are two major approaches for density-based methods. The first approach is 
represented hy DENCLUE [3]. It exploits a density function, e.g., step function or 
Gaussian function to measure the density in attribute metric space. Clusters are identi- 
fied hy determining corresponding density attractors. Thus, clusters of arbitrary shape 
can be easily determined by overall density functions. This algorithm scales well with 
run time complexity O(nlogn) by means of grid cells techniques. However, it re- 
quires careful selection of the density parameter a and noise threshold which may 
significantly influence the quality of the clustering results [10]. 

The second approach calculates the density of all data points and groups them 
based on density connectivity. Typical algorithms in this approach include DBSCAN 
[6] and OPTICS [8]. DBSCAN first defines a core object as a set of neighbor points 
consisting of more than a specified number of data points. All the data points reach- 
able within a chain of overlapping core objects define a cluster. The run time com- 
plexity of DBSCAN is O(wlogw) for spatial data when using a spatial index. Other- 
wise, it is O(w^)[10]. OPTICS can be considered as an extension of DBSCAN 
without providing global density. It assumes each cluster has its own density parame- 
ter and uses a random variable to learn its probability distribution. It has the same run 
time complexity as DBSCAN, that is, 0{nlogn) if a spatial index is used and O(n^) 
otherwise. 

However, the spatial index techniques, such as R tree, R* tree, and grid cell, are 
known to be suitable for low dimensional data sets. They perform well in 2-3 dimen- 
sions. In high dimensional spaces they exhibit poor behavior in the worst case and in 
typical cases as well [0]. The reason is that the data space becomes sparse at high 
dimensionalities causing the bounding regions to become large. In this paper, a 
unique approach to efficient neighborhood search using EIN-rings, and a new effi- 
cient density based clustering algorithm are developed. The center idea is to make use 
of P-trees and EIN-rings to calculate the density function in 0{4n ) time, on the aver- 
age. Our approach exploits compressed vertical data structures, Peano Trees (P-trees), 
and fast P-tree logical operations to accelerate the calculation of the density function 
within EIN-rings. This approach stands in contrast to the ubiquitous approach of ver- 
tically scanning horizontal data structures (records). Furthermore, we adopt a look 
around pruning method to combine the density calculation and a hill climbing tech- 
nique. The overall run time complexity is 0{dn4n) for a d-dimensional data set, on 
the average. Experimental results show that the algorithm works efficiently on large- 
scale, high-dimensional spatial data, outperforming other density methods signifi- 
cantly. 

This paper is organized as follows. In section 2, we first briefly review the basic P- 
trees, and then present a variation of P-tree, range predicate tree. In section 3, we 
define a unique equal interval neighborhood rings, EIN-rings, and then present the 
new efficient density clustering method using EIN-rings. Finally, we compare our 
method with other density methods experimentally in section 4 and conclude the 
paper in section 5. 
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2 Extended Peano Trees 

A new tree structure, the Peano tree (P-tree), was developed to facilitate efficient data 
mining [1][2]. In this section, we first briefly review the basic P-trees, and then de- 
velop a new calculation method of a variation of P-tree, range predicate trees. In this 
paper, we use a, v and prime (’) to denote P-tree operations AND, OR and NOT, 
respectively. 



2.1 Review of Basic Peano Trees 

A basic P-tree is a lossless, bitwise, vertical quadrant-based compressed tree, which 
can be 1 -dimensional, 2-dimensional, 3-dimensional, etc. For a data set with d feature 
attributes, X = (Aj, A^ . . . AJ, and the binary representation of j* feature attribute A. as 
b b ,...b b ,b we strip each feature attribute into several files, one file for each 

j,m j,m-l j,i j,l j,0" r ■ 

bit position. Such files are called bit files. A bit file is then recursively partitioned into 
quadrants and each quadrant into sub-quadrants until the sub-quadrant is pure (en- 
tirely 1-bits or entirely 0-bits). The recursive raster ordering is called the Peano or Z- 
ordering in the literature - therefore, the name Peano tree. 

We illustrate the detailed construction of P-trees using an example shown in Fig.l. 
The spatial data is the red reflective value of a 2-dimensional spatial data, which is 
shown in a). We represent the reflectance as binary values, e.g., (7)j„ = ( 111 ) 2 - Then 
strip them into three separate bit files, one file for each bit, as shown in b), c), and d). 
The corresponding basic P-trees, Pj, P 2 and P 3 , are constructed by recursive partition, 
which are shown in e), f) and g). 

As shown in e) of Fig.l, the root of Pj tree is 36, which is the 1-bit count of the en- 
tire bit file-1. The second level of Pj contains the 1-bit counts of the four quadrants, 
16, 7, 13, and 0. Since quadrant 0 and quadrant 3 are pure, there is no need to partition 
these quadrants. Quadrant 1 and 2 are further partitioned recursively. We note here 
that we identify quadrants using a Quadrant identifier, Qid - the string of successive 
sub-quadrant numbers (01,2 or 3 in Z or Peano order, separated by (as in IP ad- 
dresses). Thus, the Qid of the bolded and underlined quadrant in Fig.l is 2.2. 

AND, OR and NOT logic operations are the most frequently used P-tree opera- 
tions. The P-tree logical operations are performed level-by-level starting from the root 
level. They are commutative and distributive, since they are simply pruned bit-by-bit 
operations. For instance, ANDing a pure-0 node with anything results in a pure-0 
node, ORing a pure-1 node with anything results in a pure-1 node. 



2.2 Range Predicate Trees 

Range predicate tree, P^ ^ y, is a basic P-tree that satisfies predicate x ^ y, where y is a 
boundary value, and x is the comparison operator, i.e., <, >, >, and <. Without loss of 
generality, we only present the calculation of range predicate P^,^, P,^^ and their proof 
as follows. 

Lemma 1. Complement Rule of P-tree Let Pj, Pj be basic P-trees, and P/ is the 
complement P-tree of Pj, then PjV(Pj’aP 2 )=PjVP 2 . 
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Fig. 1. Construction of 2-D Basic P-trees for Spatial Data. 



Proof: 

p,v(p;aP,) 

(according to the distribution property of P-tree operations) 
= (P,VP/)A(P,VP,) 

= True a(PjVPj) 

= PvP 

1. 1 V X 2 



Proposition 1. Let A be j* attribute of data set X, m be its bit-width, and P,,,, P^_j, . . . P„ 
be the basic P-trees for the vertical bit files of A. Let c=b_,,...bj...b|,, where bj is i“' bi- 
nary bit value of c, and P^^ be the predicate tree for the predicate A>c, then 



Pa>c = P„ oPn. ■ ■ ■ Pi op. P . . . . op,^j P,, k<i<m, 



(1) 



where 1) op. is a if b.=l, op. is v otherwise, 2) k is the rightmost bit position with 
value of “0”, i.e., b.,=0, bj=l, Vj<k, and 3) the operators are right binding. Here the 
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right binding means operators are associated from right to left, e.g., op^ Pj opj P„ is 
equivalent to (P^ op^ (Pj opj P„)). 

Proof (by induction on number of bits): 

Base case: without loss of generality, assume b„=l, then need show P,^^ = Pj opj P„ 
holds. If bj=l, obviously the predicate tree for A>(11)2 is P^^ =PjAP„. If bj=0, the 
predicate tree for A>(01)j is P^^^ =PjV(Pj’aP^). According to Lemma 1, we get P^^ 
=PjVP„ holds. 

Inductive step: assume P^^ = P_^ op_^ ... P^, we need to show P^^^ = P_^^|Op_^^jP_j op_^ 
. . .P, holds. Let P ,^= P„ op„ . . . P„ if b„^ =1, then obviously P^^ = P„^^a P^^,, If b„^ = 0, 
then P^,^ = P„^jV(P’„^jA P^^J. According to Lemma 1, we get P^^ = P„^^v P^^, holds. 

Proposition 2. Let A be j* attribute of data set X, m be its bit-width, and P„,, P„, j, . . . P„ 
be the basic P-trees for the vertical bit files of A. Let c=b„,...b|...b„, where bj is i“' bi- 
nary bit value of c, and P^<^ be the predicate tree for A<c, then 

Pa<c = P’mOp„ . . . Pf op, P’,, . . . op,^,P\, k<i<m, (2) 

where 1). op, is a if b,=0, op, is v otherwise, 2) k is the rightmost bit position with 
value of “0”, i.e., b,,=0, bj=l, j<k, and 3) the operators are right binding. 

Proof (by induction on number of bits): 

Base case: without loss of generality, assume b„=0, then need show P,^<^ = P’, op, 
P’„ holds. If bj=0, obviously the predicate tree for A<(00)j is P,^<^ =P’,aP’„. If b,=l, the 
predicate tree for A<(10)2 is P,,^<^ =P’jV(P,aP’„). According to Lemma 1, we get P,^<^ 
=P’,vP’„ holds. 

Inductive step: assume P,^<,, = P’„ op„ . . . P\, we need to show P,^<,, = P’„+,op„^,P’„ op„ 
. . .P\ holds. Let P P’„ op„ . . . P\, if b„^,=0, then obviously P,^<„ = P’„^,a P If b„^ = 
1, then P,,<^ = P’„^,v(P„^,A P .^,). According to Lemma 1, we get P,,<^ = P’„^,v P,^„ holds. 

Theorem 1. Complement Rule Let A be attribute of data set X, P,^< „ and P,^^,, are 
the predicate tree for A<c and A>c, where c is a boundary value, then P,^^ = P’,^^,,. 

Proof; It is obvious. This theorem can be exploited to reduce computation time of 
predicate trees. 



3 The EIN-Ring Based Density Clustering Algorithm 

In this section, we present an EIN-ring based Density Clustering approach (EDC). We 
first define neighborhood rings and equal interval neighborhood ring (EIN-ring), and 
then describe the approach of calculation of EIN-ring using P-trees. In section 3.2, we 
describe calculation of the density function using EIN-rings. In section 3.3, the algo- 
rithm for finding density attractors is developed. Finally, the efficiency of our algo- 
rithm is analyzed in terms of time complexity. 
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3.1 EIN-Ring Based Neighborhood Search 

Definition 1. The Neighborhood Ring of data point c with radii and is defined as 
the set R(c, rj = {xg X | rj<|c-x|< r^}, where |c-x| is the distance between x and c. 

Definition 2. The Equal Interval Neighborhood Ring of data point c with radii r 
and fixed interval X is defined as the neighborhood ring R(c, r, r+X) = {xg X | r < |c-x| 
< r+^}, where |c-x| is the distance between x and c. 

The interval X is a user-defined parameter based on accuracy requirements. The 
higher the accuracy requirement, the smaller the interval. For r = k^, k=l,2,..., the 
rings called the k"" EIN-rings. Fig. 2 shows 2-D EIN-rings with k = 1,2, and 3. 




Fig. 2. Diagram of EIN-rings. 

The calculation of neighbors within EIN-ring R(x, r, r-H^) is as follows. Let be 
the P-tree representing data points within EIN-ring R(x, r, r-t^). We note P^;^ is just the 
predicate tree corresponding to the predicate x-r-^<X<x-r or xH-r<X<x-i-rH-^. We first 
calculate the data points within neighborhood ring R(x, 0, r) and R(x, 0, r-t^) by P^ 
and respectively. is shown as the shadow area of a) and P\^ 

;i<x<x+t+^ is the shadow area of b) in Pig. 3. The data points within the EIN-ring R(x, r, 
r-tX) are those that are in R(x, 0, r-t^) but not in R(x, 0, r). Therefore P,iis calculated 
by the following formula, which is the shadow area shown in c) of Pig. 3. 

f*x-r-^xSx+r+^ x-r<x£x+t 



3.2 Calculation of the Density Function Using EIN-Ring 

Density based clustering algorithm is a clustering method based on a set of density 
distribute function, called an influence function, which describes the impact of a data 
point within its neighborhood. Our algorithm employs a special EIN-ring based influ- 
ence function. The overall density of the data space is then modeled as the sum of the 
influence functions of all data points. Clusters are determined by identifying density 
attractors, where density attractors are local maxima of the overall density function. 
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^x-r-?^<X^x+r+^ ^ x-r<X^x+r c)Px- x-r<X^x+r 

Fig. 3. Calculation of Data Points within EIN-ring R(x, r, r+X). 

Let X and y be data points in F", a d-dimensional feature space. The influence func- 
tion of the data point y on x is a function f/: F** -> R/, which is defined based on EIN- 
ring: 



f,i,r2'(x)=l if ye R(c, Tj, r,) 

=0 if yg R(c, rj, r^) . 



(4) 



The EIN-ring based density function of x is defined as the weighted summation of 
RC(x,r), which is calculated as follows 



m 

=2wr*frl2(x) ( 5 ) 

/-=! 

m 

= * RC{x,r) . 

r=l 

where 4°(x) denotes the EIN-ring based density of data point x, with respect to 
weights, Wj. The selection of this weight is based on a RBF kernel function of the 
radius of EIN-ring, such as Gaussian function, step function, etc. 



3.3 Finding Density Attractors Using the Look Around Pruning Technique 

Once the density of each data point is defined, the next step is to define density attrac- 
tors, i.e., local maxima of the overall density function. Having a high density doesn’t 
necessarily make a point a density attractor - it must have the highest density among 
its neighbors. Instead of using formal hill climbing as is done in DENCLUE [3], we 
adopt a simpler heuristic look around technique. 

Algorithm 1. Look Around Pruning We first define a neighborhood as a ball of 
some chosen radius r. The number r can range from 0 to the maximal bit length of the 
attributes. After finding the density function of a point, x, we compare that density 
with that of data points within its neighborhood. If it is greater than the density of all 
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its neighbors, it is labeled as a new density attractor. Any old density attractor in that 
neighborhood is de-labeled as a density attractor. 

After all the data points have gone through the process above, we have a set of in- 
termediate density attractors. We compare each intermediate attractor’s density with 
that of its nearest neighbor data point. If the former is less than the latter, the attractor 
is de-labeled. Otherwise, it is a final density attractor. This step finds attractors that 
are isolated and therefore should be removed as noise. 

Definition 3. Density Attractor Set Given a sequence of points Xj, x^, ... x_^, the Den- 
sity Attractor Set DAS (Xj, x^, ... x^) is a set of attractors produced by the look around 
algorithm applied to the data points in the order, Xj, x^, ... x„ 

Definition 4. A data point x is reachable from data point y if x g R (y, 0, r), where r 
is the user-defined radius for the density clustering. If x is reachable from y, y is also 
reachable from x. 

The look around pruning algorithm is robust, which means the clustering results 
are independent of data point treating order. The proof is given as follows. 

Lemma 2. (Density Characterization Lemma) Data point y is a density attractor 
iff Dy>Dz, Vz G R{y,0, r) . If y is not a density attractor, 3 z g R(y, 0, r) 3 : Dz> Dy . 

Lemma 3. Given a data point y, and Dy > Dz , z e R(y, 0, r), y is the density attrac- 
tor independent of the order in which y and z are treated in the look around process. 

Proof (Proof by contradiction): 

Assume the statement is not true, i.e. y is an attractor, but 3 z g R(y, 0, r), 3 : 
Dz> Dy ,\i z is treated first and z is an attractor, then when y gets treated, y would 
not be an attractor (Lemma 3.2.1). If y is treated before z, y could be designated an 
attractor at that time. But when z gets treated, y will be de-labeled according to look 
around pruning algorithm 3.2.1. Therefore y is not an attractor. Contradiction! 

Theorem 2. Given data set X in two different sequences: {Xjj,Xj 2 , ...x^J and { x^j, x^^, 
. . ., XjJ, then DAS(x,„Xj„ . . .xj = DAS(Xjj,x.„ . . .x^J. 

Proof (Proof by contradiction): 

Assume the statement is not true, i.e. DAS(Xjj,Xjj, ...xj DAS(x.j,Xp, ...x,_^). That 
means 3 x g DAS(Xjj,X;j, ...xj but x i DAS(Xjj,Xj 2 , ...x^J. According to x g 
DAS(X jj,X| 2 , ...xj and Lemma 3.2.1, Vzg R{xS),r) Dx>Dz . Also according to x g 
DAS(Xjj,Xj 2 , ...XjJ and Lemma 3.2.1, 3 ze R(x,0,r) 3 : Dz > Dx . Contradiction I 

We illustrate the finding of density attractors using look around pruning algorithm 
as follows. Suppose Qid of data point X is 0.3.2 and D^ = 250. We need compare D^ 
with the neighbor’s density. From the Px,a, x has four neighbors with Qids of 0.0.2, 
0.3.1, 2.3.0 and 2.3.3. If densities of these points are respectively 300, 0, 220 and 0, 
and 0.0.2 and 2.3.0 are labeled as density attractors. By comparing Dx with the 
maximal density of 0.0.2 and 2.3.0, 250 < max(300, 220), therefore we determine that 
X is not a density attractor. Otherwise if D,^ = 350, 350 > max (300, 220), x is labeled 
as the new density attractor. The old density attractors 0.0.2 and 2.3.0 are de-labeled 
and will not be considered later. Finally, clusters are determined by the density attrac- 
tors. The pseudo code of overall algorithm is shown in Fig. 4. 
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INPUT: P-tree Set R for bit j and attribute i, HOBBit ring R(i, 0, o) 

OUTPUT: Density attractors 

// Py - P-tree for attribute i and bit j; 

// P - Neighborhood P-tree; 

// N - # of data points; n - # of attributes; 

// flag[i] - label array of cluster center of data point i. 

BEGIN 

EOR i=l toNDO 
flag[i]^ 0 

P •G Purel P-tree, DENS[i] ^ 0, PrevRC ^ 0 
FOR h = 1 TO m - 1 DO 

P <- Purel P-tree 
FOR j = 1 TO n DO 
GETb [i] 

IFbJii = l 

ELSE 

PX^^P'y. 

P[h] ^P 



END FOR 
P •G P & P fh] 
w [i] ='h * 2 

DENS[i]^ DENSH-l- w 
PrevRC <- RootCount(P ); 



(RootCount(P )- PrevRC); 



IF h = m - <5 

P <- P 

END FOR 

IF DENS[i] > the density of attractors within neighborhood , 
flag[i] <- 1, clear the flags of its neighbors. 

END FOR 

// Final look around pruning to intermediate attractors 
FOR i = 1 to N DO 

IF flag[i] = lDENS[i] < The density of the closest neighbor 
Clear its flag 
END FOR 



Fig. 4. EIN-ring base density Clustering algorithm 



3.4 Time Complexity Analysis 

Let / be the fan-out of a P-tree and let n be the number of data points it represents. 
We first present some Lemmas on P-trees, and then derive the average run time com- 
plexity to be 0{n4n ) . 

Lemma 4. The number of level of P-tree k = log(/) n 

Proof: The numbers of nodes in each level of P-trees are: 1, /, f, f, ... Obvi- 
ously the leaf level k is n bits long, i.e. f = n. Thus k = log(/) n. 
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Lemma 5. The maximum number of nodes in P-tree in the worst case ri = (n - !)/(/ 

- 1 ). 

Proof: Without compression, the total number of nodes is r] = 1 + / + + . . . ' 

= if -!)/(/- !)• According to Lemma 3.3.1, f = n, we get 
il = (n- !)/(/- 1) 

Lemma 6. Total number of nodes in a P-tree with a compression ratio of p (p<l) is rj 
= 1 H- (p’‘ * n - /) / (/ * p - 1), where k is the number of levels of P-tree. 

Proof: The numbers of nodes in each level of a P-tree with compression ratio p at 
level i is /' * p‘'\, where i ranges from 1 to k.. For example, at level 2, there are (/ * 
p)*/ = /^*p nodes. We get the total number of nodes in the case that the P-tree has a 
compression ratio of p as 

ri =i+f + f*p + f*p^+___+f'*p'^-^ 

= i+/Mrv‘‘-i)/(/*p-i) 

= i+(/“*p'-/)/(/*p-i) 

= l+(p^*n-/)/(/*p-l) 

Corollary 1. When p = 0, the total number of nodes in the P-tree is 1; when p = 1, 
the total number of nodes in the P-tree is (n - /) (/ - 1) H- 1. When p = 0.5 and / = 4, 
the total number of nodes in a P-tree with compression ratio p is 

p = 1 H- (4V2“ - 2 *4) / (4 - 2) 

= 1 H- (4"'" - 8) /2 

= 1 + ( Vn - 8 ) /2 



Theorem 3. The average run time complexity of EDC with compression ratio 0.5 
and fan-out 4 is O (d*n * ^fn ), where d is the number of dimensions. 

Proof: The P-tree ANDing operation is executed node by node when calculating the 
density. Each node ANDing is counted as one operation. For n data points in d- 
dimension, there are d*m basic P-trees, here m is the maximal bit size of each dimen- 
sion. The total run time to get density P-trees is d*m*n*p, where p is the total number 
of nodes of a P-tree. 

For data sets with fan-out / = 4 and average compress rate p = 0.5, according to 
Corollary 3.4.1, the total number of nodes of a P-tree p = 1 H- ( fn - 8) /2. Therefore, 

the total time to get the density for n data points in d-dimension is d*m*n * (1 H- ( fn 
- 8) 12). Thus, the average time complexity of density based clustering using P-tree 
with compression ratio 0.5 and fan-out of 4 is O (d*n * fn ). 
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4 Experiment Evaluation 

Our experiments were implemented in the C++ language on a IGHz Pentium PC 
machine with 1GB main memory, running on Debian Linux 4.0. The test data in- 
cludes the aerial TIFF image (with Red, Green and Blue band reflectance values), 
moisture, and nitrate map of the Oakes Irrigation Test Area in North Dakota. The data 
is prepared in five sizes, that is, 128x128, 128x256, 256x256, 256x512, 512x512. The 
data sets are available at [4]. We evaluate our proposed EIN-ring base density 
Clustering algorithm (EDC) with respect to scalability, which is tested by increasing 
number of data records and number of attributes. 

In this experiment, we compare our proposed EDC with Density Function based 
Clustering method using Euclidian distance (DEC). The experiment was performed on 
the five different sizes of data sets. The average CPU run time of 30 runs is shown in 
Fig.5. 

□ DFC ■ EDC 




Fig. 5. Running Time Comparison of EDC with other Density Clustering 

Prom Pig. 5, we see that EDC method is much faster than all of them on these five 
data sets. Especially when the data set size increases, the time of EDC method in- 
creases at a much lower rate than other methods. The experiment results show that 
EDC method is fast and scalable for large spatial data set. 



5 Conclusion 

In this paper, a unique approach to efficient neighborhood search using EIN-rings, 
and a new efficient density based clustering algorithm are developed. Our approach 
exploits compressed vertical data structures, Peano Trees (P-trees), and fast P-tree 
logical operations to accelerate the calculation of the density function within EIN- 
rings. This approach stands in contrast to the ubiquitous approach of vertically scan- 
ning horizontal data structures (records). The overall run time complexity is Oidn^n) 
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for a d-dimensional data set, on the average. Experimental results show that the algo- 
rithm works efficiently on large-scale, high-dimensional spatial data, outperforming 
other density methods significantly. 

Our method is particularly useful for data streams. In data streams, such as large 
sets of transactions, remotely sensed images, multimedia video, etc., new data keeps 
on arrival continually. Therefore both speed and accuracy are critical issues. Achiev- 
ing high speed using P-tree, and high accuracy using the weighted EIN-rings provides 
a density based clustering method that is well suited to the clustering of steam data. 
Besides spatial data, our method also has potential applications in other areas, such as 
DNA micro array and medical image analysis. 
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Abstract. This paper proposes a grid-based clustering method that dynamically 
partitions the range of a grid-cell based on its distribution statistics of data ele- 
ments in a data stream. Initially the multi-dimensional space of a data domain is 
partitioned into a set of mutually exclusive equal-size initial cells. As a new 
data element is generated continuously, each cell monitors the distribution sta- 
tistics of data elements within its range. When the support of data elements in a 
cell becomes high enough, the cell is dynamically divided into two mutually 
exclusive smaller cells called intermediate cells by assuming the distribution of 
data elements is a normal distribution. Eventually, the dense sub-range of an 
initial cell is recursively partitioned until it becomes the smallest cell called a 
unit cell. In order to minimize the number of cells, a sparse intermediate or unit 
cell can be pruned if its support becomes much less than a minimum support. 
The performance of the proposed method is comparatively analyzed through a 
series of experiments. 



1 Introduction 

Recently, several data mining methods[ 1,2,3] for a data stream are actively proposed. 
A data stream is a massive unbounded sequence of data elements continuously gener- 
ated at a rapid rate. Due to this reason, it is impossible to maintain all elements of a 
data stream. Consequently, data stream processing should satisfy the following re- 
quirements [4]. First, each data element should be examined at most once to analyze a 
data stream. Second, memory usage for data stream analysis should be restricted 
finitely although new data elements are continuously generated in a data stream. 
Third, newly generated data elements should be processed as fast as possible to pro- 
duce the up-to-date analysis result of a data stream, so that it can be instantly utilized 
upon request. To satisfy these requirements, data stream processing sacrifices the 
correctness of its analysis result by allowing some errors. 

This paper proposes a grid-based clustering method that dynamically partitions the 
range of a grid-cell based on its distribution statistics of data elements in a data 
stream. Initially the multi-dimensional space of a data domain is partitioned into a set 
of mutually exclusive equal-size initial cells. As a new data element is generated 
continuously, each cell monitors the distribution statistics of data elements within its 
range. When the support of a cell becomes high enough, the cell is dynamically di- 
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vided into two mutually exclusive smaller cells, called intermediate cells, based on its 
distribution statistics. Similarly, a dense intermediate cell itself can be partitioned but 
it is replaced by its two-divided cells. Eventually, the dense sub-range of an initial 
cell is recursively partitioned until it becomes the smallest cell called a unit cell. A 
cluster of a data stream is a group of adjacent dense unit cells. As the size of a unit 
cell is set to be smaller, the resulting set of clusters is more accurately identified. In 
order to minimize the number of cells, a sparse intermediate or unit cell is pruned if 
its support becomes much less than a minimum support. 

The rest of this paper is organized as follows. Section 2 presents related works. 
Section 3 presents the proposed statistical a-partition clustering algorithm in detail. In 
Section 4, several experimental results are comparatively analyzed to illustrate the 
various characteristics of the proposed method. Finally, Section 5 presents conclu- 
sions. 

2 Related Works 

Clustering is a process of finding groups of similar data elements which are defined 
by a given similarity measure. Clustering techniques are categorized into several 
methods: partitioning, hierarchical, density-based and grid-based. The partitioning 
method such as k-means[5] and k-medoid[6] divides the data space of a data set into 
k mutually disjoint regions called clusters. The number of clusters should be prede- 
fined in advance. The k-medoid algorithm selects k data elements as the centers of k 
clusters initially, and repeatedly replaces one of the selected centers until it finds the 
best set of k centers. In this method, noise data elements can substantially influence 
the generation of a cluster, so that it may be difficult to produce a correct result in 
some cases. The hierarchical method such as BIRCH[7] and CURE[8] decomposes a 
data set into a tree-like structure. In BIRCH, a CF(Clustering Feature) tree which is 
used to summarize cluster representations is generated dynamically. After the CF tree 
is built, any clustering algorithm such as a typical partitioning algorithm is then used. 
In CURE, instead of using a single centroid to represent a cluster, a fixed number of 
well-scattered data objects is selected to represent a cluster. The selected representa- 
tive data objects are shrunk towards the centroid of their cluster by a specified shrink- 
ing factor in the process of clustering. Among the clusters, two adjacent clusters 
whose representative data objects are the closest can be merged into one cluster until 
a predefined number of clusters is left. A typical density-based clustering algo- 
rithm[9] which regards a cluster as a region in a data space with a high density of data 
elements. Its strong points are that it can discover an arbitrarily shaped cluster, and 
control noise data easily. In the grid-based clustering method, the data space of a 
problem is divided statistically into a set of equal-size cells. A cluster is generated by 
merging adjacent cells that have more than a predefined number of data elements. Its 
time complexity is very efficient but the accuracy of a cluster is affected by the size 
of a cell. STING[10] uses a grid-based multi-resolution data structure in which a data 
space is divided into rectangular cells. There are several levels of such rectangular 
cells corresponding to the different levels of resolution. 
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Most conventional clustering algorithms assume a data set is fixed and focuses on 
how to minimize processing time or memory usage algorithmically. When a data set 
is enlarged incrementally, it is more efficient to use incremental clustering algo- 
rithms [7, 11] which mainly focus on how to utilize the previous clustering result of an 
original data set in clustering its enlarged data set efficiently. In other words, the set 
of old data elements is scanned only when a new possible cluster may be found by the 
set of newly added data elements. Therefore, all the old data elements should be 
maintained physically. 

In [13], a k-median algorithm is proposed to find the clusters of data elements gen- 
erated in a data stream. It regards a data stream as a sequence of stream chunks. A 
stream chunk is a set of consecutive data elements generated in a data stream. When- 
ever a new stream chunk containing a set of newly generated data elements is formed, 
the LSEARCH routine which is an 0(l)-approximate k-medoid algorithm is per- 
formed to select k data elements from the data elements of the stream chunk as the 
local centers of the chunk. The algorithm confines its memory space to holding a 
fixed number of local centers for previous stream chunks. Therefore, if retaining ik 
centers is impossible at the t stream chunk, the LSEARCH routine is performed 
again to cluster the weighted ik points to retain k centers. 



3 a-Partition Clustering 

Given a data stream D of d-dimensional data space N=NjXN 2 X ...xN^, a data ele- 
ment generated at the y^turn is denoted by e'=<e/,e/,...,e_/>, e/eNj l<i<d. When a 
new data element e is generated at the f“' turn in a data stream D, the current data 
stream is composed of all the data elements that have ever been generated so far 
i.e. D'={e,e\...,e'}. The total number of data elements generated in the current data 
stream D' is denoted by |D ' |. 

Einding a cluster of similar data elements in the current data stream D' is identify- 
ing a region whose current density of data elements is dense enough. A unit cell 
whose length in each dimension is less than X is used to define the similarity between 
data elements. The current support of a cell is the ratio of the number of those data 
elements in D' that are inside the cell over the total number of data elements in D'. 
Therefore, a cluster at D' is a group of adjacent dense unit cells whose current sup- 
ports are greater than or equal to a predefined minimum support 

The range of each dimension Nj is initially partitioned by p number of mutually 
exclusive equal-size intervals 77= [i/,/0 l<y^ where andf. denote the start and end 
values in the y* interval of the /“’dimension. Consequently, number of initial cells 
are formed in N and each initial cell g is defined by a set of d intervals [1^,1^, 
7cNj !</<d. The range R(g) of an initial cell g is a rectangular space rs=I^x ...x/^. 
However, the initial rectangular space of an initial cell becomes a set of rectangular 
spaces RS={rs^,rs 2 ,■■■,rs^} as a series of cell partitioning and pruning operations are 
performed subsequently. When these rectangular spaces are projected to the /“’ dimen- 
sion, the intervals of the /“’ dimension of a cell g can be found and they are denoted by 
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The sum of these intervals is defined as the interval size of the * 
dimension of the cell g. The range of the cell g is the united spaces of all the 

q 

rectangular spaces R(g)= IJ rsi . Each cell keeps the current distribution 

i=l 

statistics of those data elements in the current data stream D‘ that are within its range 
as defined in Definition 1 . 



[Definition 1] Distribution Statistics of a grid-cell g(RS,c,ju,a) 

For the current data stream D\ a term g(RS, c, jJ, d) is used to denote the distribution 
statistics of a cell g which is defined by a set of its rectangular spaces RS. Let D‘ 
denote those elements in D' that are in the range of the cell g, e\ eeD' and e 

e R(g) }. The distribution statistics of the cell g are defined as follows: 

i) c : the number of data elements in 

ii) > : jill denotes the average of the t dimensional values of the data 
elements in D 

g 

A'= ^e{ !c‘ , l<i<d 
i=i 

iii) cr=<(j/,...,(j^' > : d denotes the standard deviation of the dimensional values 
of the data elements in D 

g 

[c‘ 

d= I(e/ -n\f/d , l<i<d 

V H 



When a new data element e is generated in the current data stream D\ its corre- 
sponding initial cell among the p‘ initial cells is identified based on the initial parti- 
tions of the data space N. If the data element is in the range of the initial cell g and the 
distribution statistics of the cell g was updated most recently at the insert of the 
data element (v<f), its statistics remain the same as g{RS, c ,fl ,<j) and they are up- 
dated to g{RS, c,iii,d) as follow: for Vi, l<i<d 



c‘=c+l, /u'= 



jUi XC +€i 



a:=.^x(ajf 






/ t i2 

-(jli ) 



For the current data stream D\ the current support of an initial cell g{RS, c\ jJ, d) 
is defined by the ratio of its count over the total number of data elements generated so 
far ,i.e. c‘/|/)'|. When the current support of the cell becomes the same as a predefined 
split support two intermediate cells g^ and g^ are created as the children 

of the initial cell. To split the range of the cell g, a dividing dimension is selected 
based on the distribution deviation of data elements in the cell g. Among the dimen- 
sions whose interval sizes for the cell g are larger than X, the one with the smallest 
standard deviation, say af is chosen as a dividing dimension. Based on the standard 
deviation (j/ in the dividing dimension, the set of intervals in the dividing dimension 
k is partitioned into two sets of intervals. One contains those intervals that are within 
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the interval in which the 68 percentage of data elements in the cell g is 

assumed to be distributed according to a normal distribution. The other includes the 
remaining intervals. The rectangular spaces of the cell g are divided into two mutu- 
ally exclusive sets by a statistical a-partition method with respect to the two sets of 
intervals in the dividing dimension. These two sets of the rectangular spaces are as- 
signed to the ranges of the two divided cells gj and g^ respectively. If a rectangular 
space of the cell g includes /t/, the corresponding interval of the dividing dimension is 
actually divided. Figure 1 illustrates how to divide the rectangular spaces of a cell g 
(RS,c,jU,a) in a two-dimensional data space. 



giHb.c.u.o) g,(RS1.c1.iJl.ol) g^(RS2.c2.u2.o2) 

RS={rs,.rS 2 .rS 3 } RS1={rSj.rs'3} RS2={rs,.rs2j} 




Fig. 1. O-partition on a cell g 



When a cell g(RS, c, jU, a) is partitioned by the above a-partition method into 
cells gj( c7‘, jul\ at) and RS^, c 2 \ fj 2 \ o 2 '), the distribution statistics of gj 



and gj are initialized as follows. Let f{x) = - 



(x-u'f 



be the normal 



^|2na‘ 

distribution function of the data elements in the dividing dimension k for the cell g. 

A+<y'k 

ct=cx j (p{x)dx , c 2 '=c-ct 
A~<y'k 



( 1 ) 



A(«i) 

jj.t= except //7/ and , jul^= J xcp{x)dx 

hiSi) 

fk(S2) 

if g 2 )< Mlk <fk( 82 )’ J x(p(x)dx - jui; 

fkiS2) 

else //2/= J x(p{x)dx 

HiSl) 
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where sjlgj and//g_j denote the smallest start and largest end value of the intervals of 
the k"" dividing dimension for the divided cell g., 1=1,2. At the same times, the distri- 
bution statistics of the original cell g{RS, c,ju,(j) are reset as c'=0 and /i.'=(r'=0 for Vl, 
1< i< d since they are carried to those of g, and g^. 

When a newly generated data element is not in the range of its corresponding ini- 
tial cell, the children of the initial cell are searched to find the one whose range in- 
cludes the element. After the target intermediate cell g is found, its distribution statis- 
tics are updated by the same way as in an initial cell. When the updated support of the 
intermediate cell itself becomes the same as and the range of the cell is larger than 
that of a unit cell, the intermediate cell g is divided into two smaller intermediate cells 
by the same way of dividing an initial cell. As in an initial cell, among the dimensions 
whose interval sizes are larger than X, the one with the smallest standard deviation is 
chosen as a dividing dimension. However, unlike an initial cell, the original interme- 
diate cell is replaced by the two divided cells. Consequently, the parent initial cell of 
the original cell becomes the parent of each divided cell. 

On the other hand, when the current support of an intermediate cell g becomes less 
than a predefined pruning support S ^ ,i.e., c7|D|‘ <S the probability of finding a 
cluster in the range of the cell in the near future is very low. Consequently, the cell is 
removed and its distribution statistics giRS c',//',cf) are returned back to its parent 
initial cell g^ Suppose the distribution statistics of the parent cell g^ were updated 
lastly at the v’* element(v<t) and they are denoted by gfRSp, cp\jup\opf where 
=<ppf jJLpf ,..., jupj > and op' =<apf op" ,..., op" >. Its new statistics gfRSp, 
cp,fjp,ap) at/)' is updated as follows: 

For all dimensions i(l<i<d), cp = cp'+c and 




When a cell is divided, the 68% of its count is assigned to one divided cell and the 
rest, i.e., 32% is assigned to the other in Equation (1). Therefore, the value of a prun- 
ing support should be less than the 32% of S,.p|, in order to avoid pruning a newly 
divided cell too soon. 
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As mentioned, a sparse intermediate or unit cell can be pruned when a data ele- 
ment in the range of the cell is generated. However, a considerable number of such 
sparse cells may not be pruned since the possibility of encountering a data element in 
the range of a sparse cell is very low. All sparse intermediate or unit cells can be 
forced to be pruned together by examining their current supports. This mechanism is 
called as a. force-pruning operation. Since the distribution statistics of all intermediate 
or unit cells should be examined, the processing time of a force-pruning operation 
takes relatively long. Due to this reason, it can be performed periodically or when the 
current number of cells reaches a predefined threshold value. 



divide N^,. ..,N^ into p intervals and create p“‘ initial cells; 

/* S(g) : the support of cell g S(g)=c7 |£>'| */ 

for a data stream D'= { e‘, e^, e } do !* tis enlarging */ 
read current data element e'; 
search the cell g which includes e‘ ; 
update //', cf, c of the cell g; 
if g is a unit cell or an intermediate cell( 
ifS(g)>=S.„{ 

if |7(g) - 5,.(g)| > k in any i dimension { 

/* dividing an intermediate cell g*/ 

find the largest cr/ among d where \ffg) - ■StCk)! > 'd, 

generate g^ and g^; 

set the statistics of g, and g^; eliminate cell g; 

1 

} 

elseifS(g) <= S^„{ 

/* pmning a cell g */ 

find the parent initial cell g^ where g is included to g^; 
update g with statistics of g; 
eliminate cell g; 

} 

} 

else if g is an initial cell { 
ifS(g)>=S.„{ 

/* dividing an initial cell g */ 

find the largest o/ among a' where [/j(g) - ■^/g)! > k; 

generate g^ and g^; 

set the statistics of g^ and g^; 

set c'=0 and /t,'=cT=0 for Vi dimension; 

1 



Fig. 2. The statistic o-partition clustering 

Figure 2 shows the detailed steps of the proposed algorithm. When a cell is split, 
the counts of two divided cells are initialized by assuming the actual distribution of 
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data elements in the dividing dimension of the cell is a normal distribution. However, 
if the actual distribution of data elements in the original cell is not the normal distri- 
bution, there is a certain estimation error. In other words, the count of each divided 
cell may be incorrectly initialized. This estimation error of a certain range in a data 
space is accumulated until the range is represented by a unit cell through a series of 
cell partitioning. Once the range becomes a unit cell, there is no additional estimation 
error since the number of data elements in its range is actually counted. Therefore, 
this accumulated error count of a unit cell is constant. However the support error of 
the accumulated error count in a dense unit cell is continuously decreased due to 
Property 1. For a new unit cell g{RS, c,jU,ct) created in the current data stream D\ let 
|Dg‘| denote the actual count of data elements in its range R(g) up to D\ The estimation 
error E(g) in the count c‘ of the cell g is constant and is defined by the difference 
between |D^'| and its estimated count c‘, i.e., E(g)= | |D^‘| - c'|. 



Property 1. (Support error decreasing property) When a unit cell g is newly cre- 
ated in the current data stream D\ its support error is E(g)/|D|'and the estimation error 
E(^) is constant. After m new additional data elements are processed subsequently, 
the total number of data elements is increased to lO^^jin D‘"^”and IZ)'*"! > |Z)‘| is satis- 
fied. Consequently, the support error of the cell g in |D becomes , and 



E(g) ,E(g) 



is satisfied. As m is increased infinitely, 



E(g) 

t+m 



converges to 0, 



i.e. lim 



E(g) 

l^t+m 



= 0 . Therefore, it can be ignorable. 



A unit cell in the current data stream D' is dense if its current support is greater 
than or equal to a predefined minimum support A cluster in the current data 
stream is a set of adjacent dense unit cells. As the size of a unit cell X is defined to be 
smaller, the range of a cluster is more precisely identified. On the other hand, a possi- 
ble dense cell is split earlier as the value of a split support is lower. Due to this 
reason, a dense unit cell is found earlier, which enables a unit cell to monitor its ac- 
tual count more accurately. Eurthermore, as the gap between and S^p„is increased, 
more number of intermediate cells are maintained while the support of identified 
clusters are more precisely found. 



4 Experimental Results 

In order to analyze the performance of the proposed method, a data set containing one 
million 4-dimensional data elements is generated by the data generator used in 
ENCLUS [14]. Most of data elements are concentrated on randomly chosen 10 dis- 
tinct data regions whose sizes in each dimension are also randomly varied from 10 to 
20 respectively. The result of the proposed method is compared with that of the grid- 
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based clustering algorithm STING. The values of a pruning support and a split 

support S^pjjUre assigned relatively to a predefined minimum support The entire 

data space of the data set is divided into 4 initial cells. In all experiments, data ele- 
ments are looked up one by one in sequence to simulate the environment of a data 
stream. 




Fig. 4. Accuracy variations to S_^ and X 




Fig. 5. Accuracy variations to S^^ 



Fig. 6. Accuracy variations to S^^j 



Figure 4 shows the accuracy of the proposed method by varying the values of X 
and The accuracy of the proposed method is measured relatively to that of 
STING. In other words, it is the ratio of the number of correctly clustered elements 
by the proposed method over the total number of data elements clustered by STING. 

Figure 5 shows the accuracy of the proposed method by varying the value of Sp„. 
The sequence of generated data elements is divided into 5 intervals each of which 
consists of 200 thousand elements. The average accuracy in each interval is shown. 
As noticed in this figure, the accuracy of the first interval is relatively lower than 
those of the other intervals. This is because the support of an intermediate cell is too 
sensitively varied in the first interval. As a result, a lot of cell partitioning operations 
are performed in the first interval to produce a set of meaningful unit cells. However, 
it becomes stabilized as the total number of data elements is increased. As the value 
of Sp„ is increased, the accuracy becomes lower since a considerable number of pos- 
sible dense intermediate cells are pruned before they become unit cells. Figure 6 
shows the effect of 8^^,, on the accuracy in the first interval. As the value of S,,p|, is set 
to be lower, unit cells are generated more quickly, so that the accuracy is improved in 
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the early stage of clustering. However, regardless of the values of the stabilized 
accuracy is the same. 




Fig. 7. Memory usage variations to 



Fig. 8. Memory usage variations to 



'CC 




Fig. 9. Memory usage with force-pruning 

Figure 7 shows the maximum number of cells in the first interval when no cell is 
pruned. The maximum number of cells is stabilized after the first interval. As the 
value of is set to be low, the maximum number of cells is increased since cell 
partitioning operations are performed frequently to generate meaningful unit cells. As 
the number of dense unit cells is increased, the maximum number of cells is stabi- 
lized. In Figure 8, the variation of the maximum number of cells is shown when 
sparse cells are pruned. After most of dense unit cells are generated, the maximum 
number of cells can be decreased by setting the value of adequately. When Sp„ is 
set to the 30% of the memory usage is not decreased. The reason is that most of 
divided intermediate cells are pruned too quickly and their initial cells are repeatedly 
partitioned again. On the contrary, when the value of is set to 10%, the memory 
usage is minimized since dense intermediate cells are successfully divided into its 
dense unit cells while sparse ones are pruned properly. 

A force-pruning operation is usually performed periodically or when it is needed. 
Figure 9 shows the memory usage of the proposed method by varying the period of a 
force-pruning operation. In this experiment, two force-pruning periods f= 1,000 and 
f= 10,000 are compared. A force-pruning period f= 1,000 means that a force-pruning 
operation is performed whenever 1,000 new data elements are processed. The mem- 
ory usage of each interval is represented by the maximum number of cells. The num- 
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ber of cells is decreased as the period is shortened. As noticed by this experiment, a 
force pruning operation does not have to be performed frequently. Instead, it only 
needs to be performed when lots of intermediate cells are partitioned. 



5 Conclusion 

In this paper, a grid-based statistical a-partition clustering method for a data stream is 
proposed. The multi-dimensional data space is dynamically divided into a set of cells 
with different sizes. By maintaining only the distribution statistics of data elements in 
each cell, its current support is precisely monitored. A dense sub-range of a data 
space is partitioned repeatedly until it becomes a set of dense unit cells. Two thresh- 
olds and Sp^ are proposed to control the performance of the proposed method in a 
data stream. A split support S^p^ is used to determine how fast dense unit cells are 
identified. A pruning support Sp,„ is used to remove meaningless sparse intermediate 
or unit cells. Therefore, it can be used to minimize the usage of main memory. How- 
ever, if it is too high, a less accurate clustering result can be obtained. By controlling 
these two thresholds properly, the performance of the proposed algorithm can be 
flexibly controlled. 
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Abstract. The interest of introducing fuzzy predicates when learning 
rules is twofold. When dealing with numerical data, it enables us to avoid 
arbitrary discretization. Moreover, it enlarges the expressive power of 
what is learned by considering different types of fuzzy rules, which may 
describe gradual behaviors of related attributes or uncertainty pervading 
conclusions. This paper describes different types of first-order fuzzy rules 
and a method for learning each type. Finally, we discuss the interest of 
each type of rules on a benchmark example. 

Keywords: Inductive Logic Programming, relational learning, fuzzy 
rule, confidence degree 



1 Introduction 

Inductive Logic Programming (ILP) [9] provides a general framework for learn- 
ing classical first-order logic rules, for which reasonably efficient algorithms have 
been developed (Progol [6], FOIL [13],...). Relational learning can be presented 
as a subfield of ILP that concerns the induction process on relational databases 
compiled in first-order logic. In this scope, we have only to consider function-free 
Horn clauses. But first-order logic cannot directly handle rules with exceptions, 
which are common in practice. This has been a motivation for introducing prob- 
abilities in ILP [7]. In fact, probabilities, implicitly appear in the FOIL control 
procedure. Indeed, during the gain computation, the value associated to a rule 
can be viewed as a confidence degree expressed in terms of “domain proba- 
bilities”. Such probabilities, together with “world probabilities”, are the basic 
notions of Halpern’s first-order probabilistic logic [4] . Domain probabilities are 
used to capture statistical information for a fixed first-order logic interpreta- 
tion. These probabilities are obtained by applying a probability measure to the 
set of valuations making rules true in the interpretation. So, there is no longer 
any genuine quantifier in a rule when the probability to encounter exceptions is 
non-zero. 

One of the difficulties of the induction of rules from examples is to manage 
real numbers and imprecision when attributes are non-binary. Classical method 
for handling real-valued attributes is to turn them into (symbolic) qualitative 
labels by discretization. Fuzzy sets are known to provide a gradual interface 
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with numerical data, by escaping the problem of sharp transitions between cate- 
gories. In the propositional framework, confidence degrees have been integrated 
in learning methods, together with the handling of fuzzy properties. At least 
three main trends of works can be distinguished w.r.t. this latter concern. First, 
neuro-fuzzy learning techniques have been developed for tuning fuzzy member- 
ship functions in fuzzy rules; see [8] for a survey. The fuzzy rules, which are 
produced in that way, are used for functions approximation in automatic control 
problems. Another research line has been investigated with a greater concern 
for the descriptive power of the fuzzy rules from the user’s point of view, by 
extending Quinlan’s [12] IDS algorithm to fuzzy decision trees, involving a fuzzy 
descriptions of classes and making use of entropy measures (extended to fuzzy 
sets) for building the fuzzy rules; see [1] for a survey. More recently, the use of 
fuzzy membership functions has been advocated by several researcher for pro- 
viding association rules in data mining with a better representation power, e.g. 

[5]. 

Presently, the majority of the methods for learning fuzzy rules are propo- 
sitional. A version of FOIL that handles membership degrees has already been 
developed [15] but the rules induced still keep a classical meaning. In this pa- 
per, we propose a method for inducing first-order rules that may include fuzzy 
predicates. We first explain how a classical database is read in terms of fuzzy 
predicates, and we further discussed different types of fuzzy rules recently intro- 
duced in a learning perspective [11]. For each type of rules, the FOIL algorithm 
is adapted by defining the corresponding confidence degree. The paper is or- 
ganized as follows. Sections 2 provides a brief background on ILP. Section 3 
presents different types of fuzzy rules and the fuzzy database. Section 4 de- 
scribes our algorithm and section 5 illustrates the approach on an toy example 
and a benchmark. 

2 Background 

We first briefly recall the standard definitions and notations. Given a first-order 
language C with a set of variables Var, we build the set of terms Term, atoms 
Atom and formulas as usual. The set of ground terms is the Herbrand universe 
T~L and the set of ground atoms or facts is the Herbrand base B C Atom. A 
literal I is just an atom a (positive literal) or its negation -lo (negative literal). A 
(resp. ground) substitution cr is an application from V ar to (resp. Ti.) Term with 
inductive extension to Atom. We denote Subst the set of ground substitutions. A 
clause is a finite disjunction of literals hV. . .V?„ also denoted {?i, . . . , In}. A Horn 
clause is a clause with at most one positive literal. A Herbrand interpretation 
I is just a subset of B: I is the set of true ground atomic formulas and its 
complementary denotes the set of false ground atomic formulas. Let us denote 
X = 2®, the power set of B i.e. the set of all Herbrand interpretations. We can 
now proceed with the notion of logical consequence. 

Definition 1. Given A an atomic formula, I,a \= A means that cr(A) G I. As 
usual, the extension to general formulas F uses compositionality. 
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I \= F means Vct, I,u \= F (we say I is a model of F ). 

^ F means V/ G X, I \= F. 

F \= G means that all models of F are models of G. 

Stated in the general context of first-order logic, the task of induction is to find 
a set of formulas FI such that: 



( 1 ) 

given a background theory B and a set of observations E (training set), where 
E, B and FI here denote sets of clauses. A set of formulas is here, as usual, 
considered as the conjunction of its elements. 

Of course, one may add two natural restrictions: 

— B E since, in such a case, H would not be necessary to explain E. 

— B U H ^ E: this means B U F[ is a consistent theory. 

In ILP, there are two ways for describing examples. The first describes the set of 
positives examples E^ and the set of negative examples E~ . The other describes 
only positive examples in E and make the closed world assumption. It is this 
hypothesis that we will uses along this paper. Each element of E is called an 
example and we call a counter-example a fact on the target concept which is 
not in E. In the setting of relational databases, inductive logic programming is 
often restricted to Horn clauses and function-free formulas, E is just a set of 
ground facts. Moreover, the set E itself satisfies the previous requirement but it 
is generally not considered as an acceptable solution since it has no predictive 
ability. Usually, rules extraction fits with the idea of providing a compression of 
the information content of E. 

There are two general types of algorithms, top down and bottom up algo- 
rithms. Top down ones start from the most general clause and specialize it step 
by step. Bottom up procedures start from a fact and generalize it. In our case, we 
will use the FOIL algorithm [13] which is a top down process. The goal of FOIL 
is to produce rules until all the examples are covered. Rules with conclusion part 
C , the target predicate, are found in the following way: 

1 . take A — >■ C as the most general clause with A = T 

2. choose the literal I such as the clause IAA^G maximizes the gain function 

3. A = IaA 

4. if confidence(A — >■ C')< threshold goto 2 

5. return A — >■ (7 

The gain function is computed by the formula: 

gain{l AA^G,A^C)=n* {log 2 {cf{l A A — >■ C)) — log 2 {cf{A — >■ (7))) 

where n is the number of distinct examples covered by Z A A — >■ (7. Given a Horn 
clause A — >• (7, the confidence c/(A — >• (7) = Confidence degrees are 

computed according to the definition of domain probabilities [4]. ILP data are 
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supposed to describe one interpretation under Closed World Assumption. We 
call IjLP this interpretation. So, given a fact /: 

IiLP \= f iS B A E \= f. 



The domain H is the Herbrand domain described by B and E. We take P as a, 
uniform probability onH. So we deduce that the confidence in a clause A — >■ C, 
with t as vector on the n free variables, is: 



c/(A( t) ^ C{t ))ijLp 



|{1^ e I iiLP h A C(t))}| 

|{^ G I IiLP h 



where | | denotes cardinality. Another possible definition of a confidence degree 
might be taken here as the proportion of the number of positive examples covered 
by the rule w.r.t. the number of total examples (positive and negative) covered 
by the rule. This confidence degree would represent the probability that a fact 
deduced from the rule is true. But this definition would not take into account 
the number of situations covered in the condition part of the rule, which is not 
always the total number of examples covered since we are in a first-order setting. 

In ILP, the goal is to learn a concept represented by a predicate. E is the 
set of all facts pertaining to the target predicate. B is the set of facts pertaining 
to predicates other than the target one. So the learned rules are (in the non- 
recursive case) composed by predicates that appear in B for the condition part 
and by the target predicate in the consequence part. 



3 Induction in Fuzzy Database 

3.1 Fuzzy Databases and Fuzzy Rules 

We consider a first-order logic database K with fuzzy predicates (e.g., heavy, 
old ...) as a set of positive facts labeled by real numbers in [0,1]. For in- 
stance in Section 5 we shall deal with a database containing facts such as 
{weight{a, heavy), 0.9) which means that the car a is very representative of heavy 
cars. Thus, K is made of pairs of the form y{A{lt))) for it € "H", where 

A(lzf) is a fact, and y{A{~^)) is the satisfaction degree associated with the fuzzy 
property A for Izf. 

There exist at least two reasons for introducing fuzzy predicates in univer- 
sally quantified rules. This may be for making them either more flexible or more 
expressive. Indeed, a fuzzy predicate can be viewed as a family of ordinary pred- 
icates whose characteristic functions are the level cut functions associated 
to the fuzzy set membership function yp, namely yPat^) = 1 iff ^ cn 

and yPa{'^) = 0 otherwise. Thus a rule “A{~t) -A is naturally associ- 
ated with the crisp rules “Aa{~t) -A Note that, if Af}{~t) holds then 

AaC^) also holds for a < p. So we may only consider the crisp approximations 
“Aq,( f ) — >■ Ca{ t )”. 
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Then, if we are concerned with flexibility, a possible understanding of the 
fuzzy rule — >■ (7(7^)” can be 

V^,3a -)> (3) 

i.e. there exists a crisp understanding of the fuzzy rule which covers each example 
(but it is not necessary the same for each example since a depends on This 
is a kind of rule yet considered in [10]. By flexible rules, we mean here rules 
which are robust since their predicates can be adapted to borderline situations. 

If we are concerned with expressivity, we may look for fuzzy rules such that 
the rule holds for each o/its level cut counterpart. This means that we have 

V^,Va (4) 

This is clearly more restrictive than (3) since the fuzzy rule is equivalent to a 
set of ordinary rules with nested predicates and summarizes it into a unique 
fuzzy rule. In fact (4) is nothing but a gradual rule [3] expressing “The more 
satisfies A, the more satisfies C” (since they are modeled by a constraint of 
the form /x(A(af)) > ^(C(af))). 

Gradual rules are one of the four basic kinds of fuzzy rules [3]. Two of them, 
namely gradual rules and certainty rules, are based on implication connectives 
and express constraints on the possible models of the world. The two other types, 
named possibility rules and antigradual rules, rather express that some values 
are guaranteed to be possible (i.e. that they exist in the base of examples). For 
instance, let us take possibility rules of the form “The more it is A, the more 
all the interpretations which makes C true (truth becomes a matter of degree 
when C is fuzzy) are guaranteed to be possible ”. This means in practice that 
“The more is A, the more there are examples for any possible interpretation 
of C” . Note that this rule cannot have any “classical” counter-example since we 
are interested in the distribution of the membership degrees in the database. 

In the following, we only consider gradual and certainty rules. Certainty rules 
contrast with possibility rules, and express that “the more is A, the more 
certain it is C”. Let us first consider the case where “A” is a fuzzy predicate 
and “C” is an ordinary predicate. This expresses that “the more it is A, i.e. the 
greater a such that A(‘^) > a, the smaller the number of exceptions of the rule 
Aq(I^) — >■ C{~ty\ Indeed when a decreases, the number of exceptions cannot 
but increase since the scope of is then enlarged. When C is also a fuzzy 
predicate, in order to preserve this understanding of the rule, we are led to look 
for rules of the form 

Mlt,Ma Ac,{lt) ^ Ci-o,{lt), (5) 

since when a increases Ci-a cover more cases. 

3.2 Application to ILP 

It is well known that algorithms for learning rules have difficulties for handling 
real- valued attributes. In fact, numerical values may lead to an infinite hypothesis 
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space. In relational learning, this problem is deeper since the hypothesis space is 
already large. The difficulties grow up when real numbers appears in the concept 
we want to learn. Real numbers are essentially treated in two way: either by 
introducing constraints or by discretization. 

Introducing constraints in first order logic consists in the use of several op- 
erators such as inequalities or mathematical functions (average, ...) [14]. Rules 
induced by this method may suffer of a lack of expressivity and generality. Fur- 
thermore, algorithms dealing with constraints go out of the scope of the standard 
resolution process in first-order logic. 

The second way is to use discretization and clusterization for transforming 
continuous information into qualitative information. Then, information can be 
directly treated in the classical logic setting. This method is the most currently 
used since it allows to cope with numerical values and to improves the readability. 
Since the clusters are usually defined in an arbitrary way before the induction 
process, the rules which are produced depend on the quality of the clusters. 
These clusters are often represented by predicates having an imprecise meaning. 
For example in the auto-mpg data in UCI, the mpg (city fuel consumption 
in miles per gallons) value can be represented in terms of the predicates “low 
consumption”, “medium consumption” and “ high consumption”. In this case, 
fuzzy labels, represented by fuzzy sets, are more appropriate for describing the 
mpg values since they avoid arbitrary thresholds between low and medium (see 
Fig. 1 for description). 




Mpg 



Fig. 1. Fuzzy cluster of mpg 



Finally, using fuzzy predicates allows to relax the rigidity of crisp clustering 
and keeps the readability of the induced rules. Moreover, the different types of 
fuzzy rules we have described allow a better description and provide new types of 
summarization of the data. Flexible rules can be viewed as a fuzzy adaptation 
of a crisp rules in the sense that there exists a reading of the fuzzy predicates, 
corresponding to a high level-cut of its fuzzy representation, which leads to a 
meaningful rule w.r.t. data. Gradual rules and Certainty rules are new type 
of rules, which describe new implicative properties of the data. In the case of 
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ILP, the goal is to find an hypothesis which is sound and complete with respect 
to the examples. The hypothesis is sound if it does not cover facts on the target 
concept which are false in the interpretation defined by the background and the 
examples.lt is complete if it covers all facts on the target concept which are 
true in the latter interpretation. In the case of fuzzy ILP, the definition of an 
example covered by a rule will depend on the type of the fuzzy rule and of the 
membership degrees of facts validating the rules. 

4 Algorithm 

In the FOIL algorithm, the guidelines for the process are: the confidence degree, 
the halting condition and the number of distinct examples covered by the rule. 
We consider that an example is covered by a fuzzy rule if it is itself covered 
by the classical counter-part of the rule. So we describe these guidelines for each 
kind of rules (see [10] [11] for details). 

Flexible Rules. This first type of meaning for a fuzzy rule — >■ 

is close to the one of a classical rule. Of course, we are now expecting that 
the satisfaction degrees of A{'^) and 0(1^) are as high as possible. So we can 
introduce classical interpretations associated with each a-cut. 

Definition 2. An a-interpretation la, given a fact f is defined by: 

Ia\= f iff B AE \= f and /r(/) > a 

In this type of interpretations, only facts having a satisfaction degree greater 
than a are true. Now we have to compute the confidence degree of the rule in 
the classical way (using (2)) for each a-interpretation. According to the intended 
meaning of the fuzzy rule, we must favor the confidence degrees of the rule com- 
puted in high a-interpretations. Indeed, we prefer the examples be covered with 
a high degree of satisfaction. The following definition, which is an adaptation in 
term of first-order logic of the one proposed by [2], takes this into account: 

cffiex{A{'t) -A C{'t)) = '^{ai - tti+i) * cf{A{'t) -A C(l^))/^. 

Oii 

where ai = 1, ...,at = 0 is the decreasing list of the satisfaction degrees that 
appear in the database. This confidence degree corresponds to the discretization 
of a Choquet integral of the confidence degrees on a-interpretations. We deduce 
the number of distinct examples covered: 

nflex{A{ t C'( t )) = 

- a^+l) * |{^ G G W I lai h a\tl,t^/^,x^\{AAC)}\ 

Gradual Rules. In this case, the values of the satisfaction degrees are only 
useful for comparing satisfaction degrees in condition and conclusion parts. So, 
we do not privilege the confidence degree in high a-interpretation as previously. 
cfgrad{A{-f) -A C(t)) = 

I IiLp\=cr[t /^]C)>n{<Jlt /^]A)}\ 

IL^GW" I Lip|=<T[T /^](A)}\ 
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When the valuation of the condition part of the rule is a conjunction of grounds 
literals, the satisfaction degree of this conjunction is the minimum of the degree 
of each literal. We deduce the number of distinct examples covered: 
ngradi^i O C'( t )) = \{x\ G G TV I 

IiLP h ^]{A A C, fx{a[tt/T^]C) > ^]A)}\ 

Type 1 Certainty Rules. The meaning of the fuzzy rule “A^^) — >■ is 

then “the more is A, the more certain it is C”. For these rules we are not 
interested in the satisfaction degrees of the consequence parts. This type of rule 
will be referred to as type 1 certainty rules in the following. The a-cut for these 
rules correspond to the following type of classical interpretation: 

Definition 3. An a-certainty interpretation, given a fact f is defined by: 

la-cert \= f iff {B \= f and p.{f) >a)orE'^f 

With this kind of rules, confidence degrees are expected to be high for high 
a-certainty interpretation. The idea is that we can be more permissive with 
respect to exceptions for the classical counterparts of the rule “A{~t) — >• 
corresponding small values of a. So, we are led to use the following Choquet 
integral. 



cfcerti{A{^) C(t)) = - a*+i) * cf{ACt) 

OLi 

We deduce the number of distinct examples covered: 

ncertl{A{ t ) — >■ C( t )) = 

-a*+i) * |{^ G G TV \ Vcert h cr[tt ,t^ /xt,x^]{A A C)}\ 

Type 2 Certainty Rules. The above definition is modified in the following 
way for taking care of the satisfaction degree of the consequence of the rules. 
This type of rule will be referred to as type 2 certainty rules in the following 

cfcert 2 {A{ t)^ C( t )) = 

|{afgW" I iiLp\=<yft /lt]{AAC),iJ.i<jl^/'^]C)>i-tJ.{crl~t /lt]A)}\ 

\I,Lp\=cr[t /lt]{A)}\ 

We deduce the number of distinct examples covered: 
ncert2{A{t) -A CCt)) = |{4 G ,3x^ G Tt" | 

IiLP h ^2 M, ^]{A A C), yL{a[tt /xt]C) > 1 - fa{a[tt, x^]A)}\ 

Thus, we can use the FOIL algorithm for inducing various kinds of first-order 
fuzzy rules by adapting confidence degree and cardinality with the type of rules 
we want to learn. 



5 Results 

5.1 Illustrative Example 

Let us consider a database that describes 21 houses in a town. First we have 
some fuzzy relational predicates such as {close{x,y),a) which means that the 
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house X is close to the house y with a membership degree a, or {know{x,y),a') 
which means that the owner of house x knows the owner of the house y at a 
degree a' (from 0 for unknown to 1 for friends). The houses are also described 
with some nearly propositional fuzzy predicates such as price{x, expensive) or 
size{x, small). 

So, in this context, we can find fuzzy rules of each type with a good confidence 
degree. For example, we find the flexible rule 

close{x, y),price{y, expensive) — >■ price{x, expensive), 

with 0.81 of confidence degree, because we can reasonably expect that a house 
which is close to an expensive one, is expensive as well (since expensive houses 
are often located in the same area) . A typical gradual rule is 

size{x, large) — >■ price{x, expensive), 

i.e. “the larger the house, the more expensive” , which describes the fact that price 
grows up with size. Its confidence degree is 0.80. A good example of certainty 
rules is 

close{x,y) — >■ know{x,y) 

with 0.95 of confidence degree if the rule is viewed as type 1 certainty rule and 
0.88 of confidence degree if the rule is viewed as type 2 certainty rule. This 
rule means “the closer the houses, the more we are sure that the owners know 
together” . The fact that owners of very close houses have a high probability to 
know each other is realistic. This probability can decrease when the distance 
between the houses grow up. As ending remark, we may observe that all these 
rules are obviously subject to exceptions and, despite their interest, they cannot 
be obtained in any way by a classical ILP machine. 



5.2 Benchmark 

As a benchmark, we use the “auto-mpg” database from UCI ^ . This database is 
constituted with informations about cars and the concept we want to learn is the 
city-cycle fuel consumption in miles per gallon. There are 398 instances of cars 
described by 9 attributes of which 5 are continuous, including the concept to be 
learn. This database con be represented in propositional logic but is sufficient to 
illustrate the interest of the approach. First, the database has been “discretized” 
with fuzzy sets. Moreover, we also built three crisp discretizations corresponding 
to i) the crisp partition of the attribute domain which is the closest to the fuzzy 
partition, ii) the support of the fuzzy sets, iii) the core of the fuzzy sets. Then, 
we learn the class of city-cycle fuel consumption according to the crisp and fuzzy 
set for all the types of fuzzy rules. 



^ http://www.ics.uci.edu/ mlearn/MLRepository.html 
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Types of rules 


nbr of rules 


coverage 


avg cf 


classical rules 


8 


0.77 


0.84 


classical rules with the core 


11 


0.59 


0.87 


classical rules with the support 


10 


0.85 


0.80 


flexible rules 


3 


0.51 


0.84 


gradual rules 


4 


0.47 


0.91 


type 1 certainty rules 


2 


0.62 


0.75 


type 2 certainty rules 


2 


0.59 


0.76 



Here are examples of rule induced by the algorithm for each type of them (classi- 
cal rules are one induced on the discretisation corresponding to the crisp partition 
of the attribute domain which is the closest to the fuzzy partition) . 

Classical rules 

cylinder s{A,'&) — >■ mpg{A,low) 
flexible rules 

displacement{A, low), weight{A, medium) — >■ mpg{A, medium) 

gradual rules 

weight{A, high) — >■ mpg{A, low) 

type 1 certainty rules 

cylinder s{A, 6),weight{A, high) — >■ mpg{A, low) 

type 2 certainty rules 

cylinders{A, 6),weight{A, high) , origin{A, 1), acceleration{A, low), 

horsepower{A, low) — >■ mpg{A, low) 

As expected, the coverage score of classical rules is between the score of 
classical rules with the core of fuzzy sets and classical rules with the support 
of fuzzy sets. It is due to the fact that, with the core of the fuzzy rules, the 
examples that are in the boundary of the crisp classes are not treated. On the 
contrary, with the support of fuzzy sets, the example of that are in the boundary 
of the crisp classes can belong to two classes. The smaller score of fuzzy rules 
w.r.t. coverage is due to the fact that fuzzy rules are harder to find than classical 
ones. This result is expected because fuzzy rules are more constrained since they 
take into account the membership degree of the valuations of each predicate. In 
fact, confidence degrees of classical rules do not rely on the distance of the data 
to the boundaries of the discretized sets. For example, let us consider a classical 
rule F with a good confidence degree, and its fuzzy flexible counterpart F' . If 
many example of F are borderline w.r.t. fuzzy sets, the confidence degree of F' 
will be lower than the one of F . On the contrary, if many counter-examples of F 
are borderline, the confidence degree of F' will be greater than the one of F. So, 
the confidence degree of fuzzy flexible counterpart of a rule is a good indicator 
of the robustness of the classical rule w.r.t. small variation of the boundaries of 
the sets. 

Type 1 certainty rules focus on membership degrees of the conditional parts 
of the rules. Gradual and certainty rules show how conditions and conclusions 
parts evolve together. These rules have a meaning far from the classical one and 
the rules that we find have not necessarily a crisp counterpart or approximation. 
The fuzzy rules that handle certainty tend to favor the non-fuzzy predicates in 
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condition part because they leave more freedom with respect to the satisfaction 
degree of covered examples. Note that some rules could be described in propo- 
sitional logic, but here the instantiations are automatically generated by the 
algorithm. As shown in some rules, the algorithm can mix fuzzy predicates and 
non-fuzzy predicates. 

6 Conclusion 

In this paper, we have provided a formal framework and a procedure for dealing 
with fuzzy predicates and learning fuzzy first-order rules of different kinds in 
the case of relational databases. Since the confidence degree computation is a 
weighted version of FOIL’S one, it is easy to deduce that the complexity of our 
algorithm is the same as the FOIL’S one. The definition of confidence degrees 
for each kind of rules allows us to take into account the fuzzy predicates in 
the algorithms that use confidence degrees for guiding the learning process. It 
is obvious that using fuzzy predicates for managing real-valued data instead of 
using crisp discretization or constraint-based induction is a good compromise 
between the readability of the rules and the flexibility of the discretization. 
Moreover, fuzzy predicates allow to extract new kinds of relations. 

Through the example, we see that fuzzy rules are often too constrained for 
covering all the examples of the target concept, but they convey information on 
the robustness of the rules w.r.t. borderline examples. So, it can be useful to 
learn fuzzy rules together with classical ones. 

In this paper, we focus on the search of different kinds of fuzzy rules and the 
definition of confidence degrees associated to each of them. In further works, it 
will be interesting to show how much fuzzy discretization is efficient in a learning 
point of view. More generally, a formal definition of ILP that handles all the types 
of rules must be defined. In this context, automatic deduction mechanisms may 
be developed for testing the efficiency of fuzzy rufes in terms of classification. 
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Abstract. This paper presents an extension of prior work by Michael 
D. Lee on psychologically plausible text categorisation. Our approach 
utilises Lee’s model as a pre-processing filter to generate a dense repre- 
sentation for a given text document (a document profile) and passes that 
on to an arbitrary standard propositional learning algorithm. Similarly 
to standard feature selection for text classification, the dimensionality 
of instances is drastically reduced this way, which in turn greatly lowers 
the computational load for the subsequent learning algorithm. The filter 
itself is very fast as well, as it basically is just an interesting variant of 
Naive Bayes. We present different variations of the hlter and conduct an 
evaluation against the Reuters-21578 collection that shows performance 
comparable to previously published results on that collection, but at a 
lower computational cost. 



1 Introduction 

In the last decade the amount of textual information in digital form has grown 
exponentially, mainly due to the forever-increasing accessibility of the Internet. 
It is crucial to create tools to organise the amount of information available. Text 
categorisation is one such tool. It aims at classifying textual documents into pre- 
defined categories. Text categorisation applications are manifold and are ranging 
from automated meta-data extraction for indexing to document organisation for 
databases or web pages (see Yang et ah, [1]). Other interesting uses of text 
categorisation include text filtering, generally as part of a producer-consumer 
relationship, or word sense disambiguation when dealing with natural languages 
processing (see Roth, [2]). 



1.1 Existing Text Categorisation Methods 

It is difficult to be exhaustive when listing the existing text categorisation meth- 
ods. Amongst the main approaches, decision tree methods ( “divide-and-conquer” 
approach) have the advantage of being “human readable” in the sense that they 
deal with symbolic entities and not numeric values [3] . Investigations using prob- 
abilistic models usually focus on Naive Bayes and its variants [4]. Joachims 
[5] introduced the support vector machine method to text categorisation. Also 
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worth mentioning is the Rocchio method [6]; this method creates a prototype 
document for each class from the training set. A test document will be assigned 
to the class of the closest prototype found. Yang [7] invented a mapping approach 
using a multivariate regression model and investigated, together with Pedersen, 
lazy learning for text categorisation [8]. Frank et al. [9] investigated text cate- 
gorisation using compression models, and Wiener et al. [10] were using neural 
networks. Yet other approaches have tried to improve predictive performance by 
incorporating semantic information like WordNet hypernyms [11]. 

1.2 David Lee’s Method 

Lee [12] came up with a psychologically plausible approach considering three 
different insights. Firstly, Lee noted that people are able not only to state that 
a given document is about a given topic but also that a document is not about 
a topic. Take for example ’’middle east conflict” as a topic; the occurrence of 
the word ’’rugby” in a document would give a strong hint about the document 
not being about the topic. Secondly, humans are able to make non-compensatory 
decisions: one can decide if a document is about a topic or not without necessarily 
having to read the whole document. Using our previous middle-east conflict 
example, if the document starts with something like ’’The south African rugby 
team just arrived in Auckland...” most people would not need to read any further 
to reach a conclusive, in this case negative, decision. Thirdly, people have the 
capacity to give answers with a level of confidence and so they are able to state 
if a document is either definitely about a topic or alternatively just remotely 
related to a topic. 

Lee’s model’s formal definition is based on a Bayesian analysis, which states 
that it is possible to compute the posterior odds of a document being about a 
topic or not by multiplying the prior odds — chances of a document to be about a 
topic before looking at it — and the evidence — probability that a document would 
have been generated under the assumption that it is about a topic (or not). Lee 
considers the document as a sequence of words. The evidence then becomes the 
product of the probability of each word being in a document about the topic 
(or not). Note that this approach follows the Naive Bayes assumption that all 
words are independent of one another, also called the independence assumption. 
The evidences’ probabilities are quantified using the number of occurrences of the 
given word over the total number of words and are calculated for both categories, 
for and against the topic. The independence assumption allows analysing words 
sequentially, which permits monitoring the evolution of the posterior odds word 
by word in the same order as they appear in the document. Considering the 
logarithm of the posterior odds and using evidences for and against the topic 
leads to the following equation: 

Pr(cjMi) _ Pr (cj) ^ Pt {wni\cj) 

Pr(^c,|d.) Pr(^c,)^^^ Pr(u;™|-c,) 

Figure 1 shows the evolution of the posterior odds of two documents processed 
by the text classifier. The graph on the left depicts the partial log-odds sums 
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for a document about the topic, while the graph on the right depicts those 
sums for a document that is not about the topic. Note that the partial log-odds 
sums are computed over larger and larger initial subsequences of the document, 
which causes the order of the words in the document to become significant. 
This example also shows the possibility of non-compensatory decision making 
by setting two thresholds, one for the document being about the topic and the 
other for the document not being about the topic. The decision is taken when 
one of the thresholds is reached. Let’s assume the thresholds in Figure 1 are 
100 for a document about the category and —20 for a document not about the 
category. In the left hand side case, the decision is taken after reading the 120*^ 
word (when the curve meets with y = 100). On the right hand side example, 
the decision can be taken after reading the 45*^ word (the curve meets with 
y = -20). 





Fig. 1. Illustration of document profiles, the left hand side one is about the topic while 
the right hand side one is not. 



1.3 Our Approach 

The investigation presented here is an extension of Lee’s work [12]. An interest- 
ing aspect of Lee’s method is that the document is processed sequentially and 
the odds of the document with respect to a given category can be tracked as 
the words are fed to the system. We call this sequence of the partial sums of the 
log-odds of the words of a document a document profile. Usually those profiles 
are not as clear-cut and easy to classify as the ones shown in Figure 1. We have 
therefore decided to investigate a two-step process, where a first step generates 
document profiles according to Lee’s method, and a second step extracts propo- 
sitional information from these profiles that then can be fed into any arbitrary 
propositional learner. Thus, Lee’s system is used as a dimensionality-reducing 
pre-processing step. 

The next section will explain this process in more detail, discuss issues with 
the dictionary, and basically describe two different ways of extracting attributes 
from document profiles. Sections 3 and 4 explain the experimental setup and 
give and discuss experimental results. In Section 5 we present conclusions and 
discuss further work. 
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2 Generating and Manipulating Document Profiles 



To construct a model, each word in the vocabulary is assigned two probabilities, 
the probability of the word being about the topic Pr (wk\cj) and the probability 
of the word not being about the topic Pr (wk\~'Cj) where Wk is the word, and cj 
the topic. The word’s influence {IwA) is then calculated as follow: 



Pr {wk\cj) 
Pr {wk\-^Cj) 



The probabilities are based on the rate at which the word has occurred in the 
training documents about and not about the category. Figure 2 (on the left hand 
side) portraits one such dictionary. The higher the magnitude of the influence, 
the more weight the word will have in the final decision. Note that there are 
much more words with a negative influence. The explanation for that lies in 
the skewedness of the training data: the dictionary pictured in Figure 2 (on 
the left hand side) was trained with 197 documents about a given category and 
9,406 documents not about that category. The 9,406 negative examples used for 
training were in fact the union of all 89 other categories. The skewedness also 
explains why the maximum positive amplitude is greater than the maximum 
negative one. Specialised words of the category in focus (word with a large 
positive influence score) are more likely to have a denser concentration in the 
positive documents than the specialised words of the other category (actually 
categories). 




Fig. 2. Dictionary for one category of the Reuters dataset and a shifted version of the 
dictionary on the right. 



The three following sub-sections will describe dictionary manipulations that 
proved to be beneficial, and explain the two ways propositional attributes are 
extracted from document profiles. 

2.1 Shifting the Origin on the y-Axis 

Figure 2 (right hand side) illustrates the result of shifting the origin on the y- 
axis in an attempt to equalise the maximum amplitudes. To shift the dictionary. 
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we subtracted half the sum of the top positive value and the top negative value 
from all the words present in the dictionary or more formally: 

„ max{Tn,) + min{T^) 

yk€d:I^^= 



Where and are respectively the influence of the word before and 
after shifting, max{Iw) the largest influence in the dictionary and min(Iw) the 
smallest. Note that a lot more words now have a negative influence, and that 
the magnitude of the positive influences has been reduced. This shift usually 
improves performance. A more sophisticated threshold selection method might 
fare even better. 



Table 1. Dictionary sizes after cutting off at x% of the top influence value. 



percent of the maximum 
positive/negative 
value cut off 


# of 

words remaining 
in the dictionary 


0% (whole dictionary) 


31651 


10% 


4046 


15% 


2860 


30% 


1288 


60% 


166 



2.2 Reducing the Size of the Dictionary 

As mentioned earlier, the specialised words carry a large influence score, but 
their distribution is highly skewed: there are almost no specialised words for the 
negative class. On the other hand, words with a low influence score, are more 
evenly distributed between both the positive and negative class. Their low influ- 
ence score causes them to play only a minor part in the final decision, but they 
can potentially add noise. We have therefore introduced a mechanism to prune 
words from the dictionary based on their influence score. The decision threshold 
is based on the maximum positive value and the maximum negative value (of 
the unshifted dictionary) . The cut off value is determined as a percentage of the 
maximum values. A cut at 30% means that all the words with influence score 
between 0 and 30% of the maximum positive influence and between 0 and 30% 
of the maximum negative influence score will not be taken into account. Table 1 
shows the non linear relation between the cut off value and the number of words 
left in the dictionary when applying this idea to the dictionary of Figure 2. This 
pruning effect is also illustrated in Figure 3 where the pruned dictionaries for 
the four different cut off values of 10%, 15%, 30% and 60% appear in clockwise 
order starting from the upper left corner. An additional advantage of pruning 
dictionaries is the potential speedup of the generation of document profiles. 
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Fig. 3. Different cut off values applied to the dictionary of Figure 2; clockwise from 
the upper left corner the values are: 10%, 15%, 30% and 60%. 




Fig. 4. Reading off attributes from a document profile. 



2.3 Turning Document Profiles into Attributes 

We have used two different methods to turn document profiles into a constant 
number of propositional attributes. We need to deal with the fact that the num- 
ber of words in each document is different, therefore also the length of document 
profiles differs. The first methods solves that problem by simply reading off the 
value of the document profile after a certain percentage of the document has been 
read. Looking at Figure 4, we see that ten values are extracted, with equal-sized 
gaps in between. In a naive approach the maximum number of attributes that 
can be extracted in this manner is limited by the size of the smallest document. 
The second method for extracting attributes is even simpler, computing just 
some very high-level summary information about a document profile. Specifi- 
cally, such a description comprises a mere seven attributes: the maximum and 
the minimum value encountered, the respective positions of these two extrema 
relative to the document length, a boolean indicator whether the maximum is 
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Table 2. The ten largest categories from the ModAPTE split. 



category 


# of 

training articles 


# of 

test articles 


earnings (earn) 


2877 


1087 


corporate acquisitions (acq) 


1650 


719 


money market (money-fx) 


538 


179 


grain (grain) 


433 


149 


crude oil (crude) 


389 


189 


trade issues (trade) 


369 


117 


interest (interest) 


347 


131 


wheat (wheat) 


212 


71 


shipping (ship) 


197 


89 


corn (corn) 


181 


56 



reached before the minimum, and the total number of words for and against 
the category (i.e. how many words carried a positive influence, how many car- 
ried a negative influence) . Obviously this is just one of a few possible high-level 
summary descriptions, other potentially interesting attributes include document 
length or final value in the profile. 

3 Experimental Setup 

To investigate the performance of the two-step process described above we have 
conducted an empirical evaluation using the Reuters corpus^. We used the same 
train-test split as proposed in [13] where a total of 12,902 documents is split into 
a train-set of 9,603 documents and a test-set of 3,299 documents. We restricted 
our evaluation to the 10 most common categories, as presented in Table 2. The 
only pre-processing operation we did was to lower case the characters. We did 
not use any stemming nor stop word removal techniques. 

For our evaluation we used the standard information-retrieval performance 
measures of precision and recall, as well as aggregate measures based on those 
two. The aggregate measures were F-measure and the precision-recall mean. 
The standard F-measure computes the harmonic mean of precision and recall 
and the precision-recall mean is the arithmetic mean (average) of those two 
measures. The four formulae are summarised in Table 3. Macroaveraging simply 
computes averages of either the F-measures or precision-recall means over several 
categories. 

4 Experimental Results 

We have conducted an extensive series of experiments to judge the performance 
of various standard classifiers using document profiles, and also to investigate 

^ The Reuters-21578 collection may be freely downloaded for experimentation pur- 
poses from www.research.att.com/~lewis/reuters21578.html 
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Table 3. Four information-retrieval evaluation measures: precision, recall, F-measure 
and precision-recall mean. 



measure name 


formula 


precision 

recall 

F-measure 

precision-recall mean 


tp 

tpffP 

tp 

tp+fn 

2tp 

2tp+fp+fn 
tp(2tp+fn+fp) 
2(tp+fp) (tp+fn) 



the effects of the dictionary tuning we have described above. For lack of space we 
will only concentrate some of the findings here, a complete report can be found 
in the forth-coming Master’s thesis of the first author. We used the following 
classifiers from the Weka [14] package: J48 (C4.5, a decision tree algorithm [15]), 
OneR (rule based algorithm [16]), IBk (k-Nearest Neighbour (fc-NN) [17]), SMO 
(Support Vector Machine [18]) and Naive Bayes (Naive Bayes algorithm [19]). 
We have also added a very simple classifier called Polarity that simply predicts 
the sign of the last value in the document profile. Polarity is closely related to 
multinomial Naive Bayes ([20]). 

4.1 Which Classifier to Use? 

Figure 5 shows the performance of the six different classifiers on the category 
trade per number of attributes taken from the profile. While J48, OneR, IBk 
and SMO show equivalent results — SMO shows an interesting behaviour, with 
recall rising at the expense of precision as the number of attributes increases 
past 100 — Naive Bayes and Polarity do not seem to be as infiuenced as the 
afore-mentioned classifiers schemes by the number of attributes generated from 
the profile. They show impressive recall, but unfortunately also poor precision. 
Qualitatively speaking, graphs for other categories look similar. 

Figure 6 depicts macroaveraged F-measures of the 6 classifiers on the 10 
categories (described in Table 2) per number of samples. Two distinct clusters 
are noticeable: above 0.65 points F-measure with J48, IBk and OneR, and below 
with Naive Bayes, SMO and Polarity. Overall, J48 and IBk are clearly dominant, 
followed closely by OneR. The poor performance of SMO, Naive Bayes as well 
as Polarity is probably caused by the high correlation between the generated 
attributes. Summing up, J48 should be preferred to IBk for the slightly bet- 
ter results and the computationally expensive classification process of the lazy 
learning approach. 



4.2 Pruning the Dictionary Plus Reading only Parts of a Document 

In this section we only employ J48, because it performed well in the experiments 
reported in the last section, and it is fast. Figure 7 illustrates the effect of using 
a pruned dictionary (on the left hand side) and of only reading initial portions 
of documents (on the right hand side) using 150 attributes extracted from the 
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Fig. 5. Performances of six different classifiers on the Reuters category Trade. 
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Fig. 6. Macroaverages for 6 classifiers plotted per number of attributes. 
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Fig. 7 . Effects of reading only initial parts of documents and pruning dictionaries. 




classifier 



Fig. 8. Comparing 7 classifiers’ precision-recall mean macroaverage over the 10 largest 
Reuters categories. 



profile. While the precision is generally not affected by reducing the number of 
words in the dictionary, recall significantly decreases as the pruning percentage 
increases. Also note that performance is much less affected by the percentage 
of the document read than by the size of the pruned dictionary. Furthermore, 
precision appears to react more robustly than recall to the effect of reducing the 
size of the portion of the document that is being actually read. 



4.3 Performance of the Summary Attributes 

In this subsection we compare the performance achievable with the seven- 
attribute summary information to various standard text classifiers. Figure 8 
shows the macroaverage of the precision-recall means of 3 variations of this clas- 
sifier against results obtained by [3] for Naive Bayes and [9] for PPM and against 
SMO and J48 on the ten largest Reuters-21578 categories. The three variations 
all used J48 on a shifted dictionary, using either 35% or 40% as a cutoff value, 
and read either 90% or the whole document. Both SMO and J48 were run on top 
of a standard bag-of-words-based feature selection using info-gain. 50 features 
for J48 and 150 features for SMO yielded the best results. The results show that 
J48 using this tiny set of features outperforms Naive Bayes and PPM, closely 
approaches standard J48, but does not perform as well as SMO. 
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4.4 Complexity of the Attribute Generation 

The algorithm’s complexity is equivalent to the complexity of Naive Bayes in 
the sense that it is linear in the number of words and in the number of categories. 
The complexity of SMO, for comparison, is on average n.log{n). Accessing the 
dictionary to retrieve influence values is generally log(n), but if necessary, perfect 
hash functions could be used to reduce this dictionary access cost to a constant. 
Computing such hash functions will be easier for smaller dictionaries. 

5 Conclusion 

This paper has presented a text classification approach based on document pro- 
files. Its predictive performance is comparable to more standard approaches, but 
the method is extremely simple, therefore fast and highly scalable. Our two-step 
approach effectively transforms a sparse learning problem into a dense one with- 
out having to explicitly select single features from the original representation. 

The most promising direction for future work will be investigating combina- 
tions of the different sets of attributes available. Two approaches are possible: 
one can combine the high-level summary, the partial sums, as well as standard 
feature subsets into one larger single set of features. Secondly, single classifiers 
can be trained on these different feature sets in isolation and then be put to- 
gether into ensembles. Another direction will be comparing our influence formula 
with the usual TFIDF document representation. A good starting point will be 
the thorough study carried out by Rennie et al. [21]. 
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Abstract. Topics in 0-1 datasets are sets of variables whose occurrences 
are positively connected together. Earlier, we described a simple genera- 
tive topic model. In this paper we show that, given data produced by this 
model, the lift statistics of attributes can be described in matrix form. 
We use this result to obtain a simple algorithm for finding topics in 0-1 
data. We also show that a problem related to the identification of top- 
ics is NP-hard. We give experimental results on the topic identification 
problem, both on generated and real data. 



1 Introduction 

Large collections of 0-1 data occur in many applications, such as information 
retrieval, web browsing, telecommunications, and market basket analysis. While 
the dimensionality of such data sets can be large, the variables (or attributes) 
are seldom completely independent. Rather, it is natural to assume that the 
attributes are organized into (possibly overlapping) topics, i.e., collections of 
variables whose occurrences are somehow connected to each other^. For example, 
in document data the topics correspond to topics of the document: e.g., phrases 
“data mining”, “decision trees” and “association rules” probably are included 
in one topic, which might be called the “data mining” topic. In supermarket 
market basket data, the topics could correspond to classes of products such 
as soft drinks, vegetables, etc. In discretized gene expression data topics could 
correspond to groups of genes that are expressed in similar conditions or tissues. 

Finding topics from data is by no means easy: the topics can be overlapping, 
and a particular topic is active only for a subset of documents. For example, 
simple frequent set based approaches are unable to find topics, as the attributes 
in a topic are seldom 1 together. There has been lots of work that searches 
for latent structure in 0-1 data (see, e.g., [1,2,3,4,5,6,7,8,9,10]). The approaches 
range from simple methods based on covariance-type statistics (e.g., [9]) to full 
probabilistic models (e.g., [4]) and to spectral approaches [10]. 

In order to discover topics from 0-1 data, one first has to specify the model 
for topics, and then give a method that finds topics corresponding to the model. 

^ Our usage of the word topic is similar but not identical to the meaning in information 
retrieval literature, where a topic is a probability distribution on the universe of 
terms, typically concentrating on a few terms. 
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In this paper we describe a simple generative topic model, based on our previous 
work [11]. We prove some analytical results about the model by using the concept 
of lift [12]. We show that the lift statistics of individual attribute pairs can be 
described in matrix form as linear combinations of lift statistics of disjoint topics. 
Based on this observation, we give a simple algorithm for finding topics in 0-1 
data. We also show that one form of the topic identification problem is NP-hard. 
We give experimental results on both generated and real data, showing that the 
algorithm works well in practice. 

First we review some other methods for finding latent structure in binary 
data. Many of these generative models are quite powerful and are able to de- 
scribe complex situations. On the other hand, finding exact solutions for them 
is computationally intractable, and it is difficult to get a clear picture of the 
quality of the obtained estimates. Many of the methods are also symmetric with 
respect to the data values 0 and 1; on the basis of the asymmetry in the data 
generating process, this can be viewed as a potential source of problems. 

In nonnegative matrix factorization (NMF) [1], an observed data matrix is 
decomposed into a product of two unknown matrices. All three matrices have 
nonnegative entries. The observed data is regarded as a sum of latent variables. 
Lee and Seung give two algorithms for finding the unknown matrices; there is, 
however, no probabilistic interpretation of the results of NMF. Computationally, 
the methods seems very demanding and there are no clear results on the quality 
of the solutions [13]. 

The latent semantic analysis (LSA) method [2] uses singular-value decom- 
position to decompose an observed data matrix into a product of matrices. (In 
contrast to NMF, the matrices can have negative entries, too.) In a seminal 
paper by Papadimitriou et al. [3] some arguments were given to justify the per- 
formance of LSI by presenting a probabilistic corpus model. Their basic model 
is quite general and somewhat similar to ours. 

Hofmann [4] has presented a probabilistic version of LSA, termed PLSA. His 
formal model is fairly close to ours and we will show comparative results on the 
models. For each observation vector, some topics are first selected according to 
some observation-specific topic probabilities; then, the topics generate attributes 
according to some topic-attribute probabilities. The attributes are conditionally 
independent given the topic. Hofmann’s main interest is in good estimation of all 
the parameters using the EM algorithm, while we are interested in the structure 
of the data (that is, the probabilities of attributes belonging to topics) and also 
explaining why the methods would find topics. 

Laten Dirichlet Allocation (LDA) [14,15,16] is a method in which the data 
model is closely similar to Hofmann’s PLSA but the estimation of the parame- 
ters is computationally more demanding: a variational approximation to the data 
likelihood is needed prior to EM estimation of the parameters. Independent com- 
ponent analysis (ICA) ([8,17,18]) is a statistical method that expresses observed 
multidimensional sequences as combinations of unknown latent variables, that 
are statistically as independent as possible. The so called probe distances [19] 
of attributes can be used to find (possibly overlapping) sets of attributes that 
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behave similarly with respect to other attributes; we studied this in an earlier 
paper [11]. Cooley and Clifton [9] compute the frequent sets in the data and 
cluster them using a hypergraph partitioning scheme, thus avoiding the problem 
of not having all attributes of a topic present in one data vector. 

A popular method to analyze 0-1 data is the class of finite mixtures of mul- 
tivariate Bernoulli distributions. However, for the Bernoulli models, the values 0 
and 1 have symmetric status, while for our topic models defined in Section 2 this 
is not the case. Another important difference between Bernoulli (or any other) 
mixture model and our model is that in mixture models it is assumed that an 
observed 0-1 vector is only generated by one latent topic, although generation 
probabilities are given for all latent topics. In this paper we assume that a data 
vector is generated by the interaction of several latent topics. Binary generative 
topographic mapping [20,21] also assumes that the data vectors are generated 
by one latent topic at a time. 

The rest of this paper is organized as follows. We describe our model and 
examine some of its analytical properties in Section 2. In Section 3 we study the 
lift statistic and describe the simple algorithm based on it. We give experimental 
results in Section 4, and conclude in Section 5. 

2 Topic Models 

In this section we present our concept of a topic model, give the likelihood 
function of the model, and discuss what kinds of parameter values are realistic. 
This form of the model was introduced earlier by us [11]. 

Let U be an n-element set of attributes (e.g., words). A k-topic model T 
arranges the n attributes into k topics. The model has the following parameters: 
a /c-element vector s = {si, . . . , Sk) corresponding to the k topics, and a fc x 
n matrix Q whose elements relate the topics to the attributes; the element 
corresponding to topic i and attribute A is denoted by Qi^A- All elements of s 
and Q must be probabilities, i.e., reals in the range [0, 1]; however, neither s nor 
any row or column of Q is required to sum up to 1. 

A data vector x (e.g., a document) is sampled from T as follows. First, the 
active topics are selected by sampling a fc-element binary vector t whose every 
component U is 1 with probability Si, independently of all other components. 
Second, the active topics generate the attributes. For each topic i, an n-element 
binary vector Xi is sampled so that the component corresponding to A is 1 with 
probability tiQi^At independently of all other components. The data vector x is 
then the logical or (i.e., maximum) of all the vectors Xi, x = V^Li ^i- 

It would be possible to add another layer on top of the topics, selecting the 
topic probabilities anew for each data vector from, e.g., a Dirichlet distribution. 
Many of our results could be generalized to such settings, which however fall 
outside the scope of this treatment. This type of approach has been taken in 
[3,4,14,15,16]. 

We next present the likelihood function of a fc-topic model T with param- 
eters s,Q. The data D consists of vectors x, each considered independently of 
the others. 
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P{D\T)= n P{^\r). 

xGD 

The probability of a single observation x is 

p{x\r) = J2p{t\T)p{x\t,r). 

t 

The sum is taken over all fc-element 0-1 vectors t, corresponding to all 2^ possible 
combinations of active topics. The probability of a topic combination depends 
on the parameters s only, 



k k 

P{t I T) = P{t I a) = l[P{U I s.) = n^^(l - 

i=l i=l 

The probability of an observation given the active topics depends on the param- 
eters Q only, 



p{x \t,T)= p{x 1 1, Q) - n I Q)> 

Aeu 

where xa denotes the element of x that corresponds to the attribute A G U. 
A single attribute has a value of either zero or one, with distribution 

P{XA \t,Q)= pT (1 - PaY~^^ = = ^ 

[pA, XA = 1 , 

where 

k 

PA = 1 - J]^(l - Qi.A)‘b 

i=l 

The likelihood function, if expanded fully, would have a large number of terms 
because of the sum over 2^ topic combinations t. This suggests a high compu- 
tational complexity, and indeed the task of selecting the best t is difficult. This 
is illustrated by the following theorem, whose proof we defer to the Appendix. 

Theorem 1. The following problem is ISP-complete: given a topic model T , a 
single data vector x and a threshold p, decide whether there is a topic assign- 
ment t such that the probability of the data given the assignment exceeds the 
threshold, P{x | t,T) > p- 

However, the models involved in the proof would best be described as con- 
trived, so the result should not dissuade us from researching some reasonable 
subclass of topic models. But what kind of models are reasonable? 

One assumption that we will make is that the topic probabilities Si are small. 
This seems reasonable at least in the context of document data: if some words 
occur in a large fraction of all documents, in information retrieval they would 
be classified as stop words and not considered in searches; it is the less common 
words that distinguish interesting documents. 
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Another question is the amount of overlap between topics - if two topics 
consist of almost completely the same attributes, it does not seem easy to dis- 
tinguish between them. In [11] we considered a class of “e-separable” models, 
an idea similar to that in [3]. A model is e-separable if every topic has a set of 
primary attributes and assigns at most a fraction e of its attribute-activation 
weight to the non-primary attributes. However, the e-separability property does 
not perfectly capture the idea of almost-disjoint topics, as the discussion in [11, 
before Lemma 3] notes: for example, several topics can “conspire” against an- 
other topic i by giving high weight to one of I’s primary attributes. Even if every 
high weight is less than a fraction e of the topic’s total weight, it is possible that 
the majority of activations of that attribute come from the conspiring topics and 
not the primary topic. 

This leads us to define a different separability concept: a model has 9 -hounded 
conspiracy if every attribute A has a primary topic i such that 

^ ^ Q jA — 
j¥=i 

We conjecture that a model is discoverable from data if it has low values of Si 
and conspiracy bounded by some low 0. 



3 Using the Lift Statistic 



We now consider a statistic commonly called called lift or interest [12,22,23], 



lift (A, H) 



P{A I B) 
P{A) 



P{A,B) 

p{A)p{bY 



which is a kind of a relative risk factor: how much more common is it to observe A 
given that B is observed, compared to no information about B7 Lift was chosen 
because it measures dependence, which is highly relevant to topic models - when 
two attributes belong strongly to the same topic, their co-occurrence should de- 
viate significantly from the independence assumption. For independent A and B, 
lift(A, B) = 1, and the stronger the (positive) dependence, the higher the lift. 
Note that our model predicts lift(A, B) > 1 for all pairs A,B G [/; thus, one 
way of assessing whether the model fits a given data set is to see how lift(A, B) 
is actually distributed. 



Proposition 1. Assume that attribute A is only generated by topic i. Then for 
any attribute B, 



lift(A, B) 



P{U I B) 
P{ti) 



P{U,B) 

P{ti)P{BY 



Proof. We factorize the probabilities: P{A) = P{A,ti) = P{ti)P{A \ tf) and 
P{A,B) = P{ti,A,B) = P{ti)P{B I ti)P{A I ti,B). Since A is only generated 
by topic i, P{A \ U, B) = p\a \ ti). Thus 



lift(A,H) 



P{A,B) 

P{A)P{B) 



P{tj)P{A I U)P{B I U) 
P{U)P{A I U)P{B) 




428 Jouni K. Seppanen, Ella Bingham, and Heikki Mannila 



Using Bayes’ theorem P{B \ ti) = P{B)P{ti \ B)/P{ti) and canceling terms we 
obtain the result. □ 

What Proposition 1 says is that if A is a “core attribute” of topic i, i.e., an 
attribute generated by i only, then A represents i perfectly in lift calculations, 
even if Qi^A < 1- Of course in practice, when the lift must be estimated from 
data, a small value of Qi^A can cause poor results. Another point to note is that 
the probability P{B \ ti) appearing in the proof is not the model parameter Qi^B- 
Instead, it is the probability that any topic will generate B conditioned on the 
fact that at least topic i is active. Proposition 1 has as immediate consequences 
two results that we used already in [11]. 

Corollary 1. If attributes A and B are only generated by topic i, i.e., Qj^a = 
Qj,B = 0 for j yf i, then lift(A, B) = s~^ . 



Corollary 2. If attribute A is only generated by topic i and attribute B is only 
generated by topic j, then lift(A, B) = 1. 

Thus, the lift statistic between attributes belonging to one topic only is very 
simple. The interesting question is how lift behaves when an attribute belongs 
to several topics. 

Assume that attribute A is only generated by topic i, and attribute B is 
generated by both topics i and j. Now lift(A, B) is, after simplification, 

P(^A, B^ Qi^B Qi,B^jQj,B ^ Qi.B ^jQj.B 

P(^A^P(^B^ SiQi^B ^jQj,B ^i^jQi,BQj,B ^iQi,B ^jQj,B 

where in the approximation we have assumed that Qi,BSjQj^B and SiSjQi^BQj.B 
are small compared to the other terms. The above formula generalizes to the 
case where B is generated by some other topics than i and j, too: before the 
approximation we then have several second order terms siQi^b corresponding to 
all topics i that generate B, and similarly several third order terms S(Qi^BQe,B 
(in the numerator) or fourth order terms SiSiQi^BQi.B (in the denominator). 

Assume now that all the topic probabilities are (approximately) equal, i.e., 
Si K, s for all topics I. Then we can write the above formula as lift (A, i?) « 
{s~^Qi^B + Qj,B)/{Qi,B + Qj,B). Furthermore, let each topic £ have ci core 
attributes that are only generated by that topic. Then using Corollaries 1 and 2 
we note that the lifts of A and all core attributes can be included in the formula 
as follows: 



Observation. The lift between a core attribute A of topic i and an attribute B 
generated by topics i and j is 



lift(A,B) « ^lift(A,A')c"^ 

A' 



Qi,B 

Qi,B Qj,B 



D' 



Qj,B 

Qi,B Qj,B 



where A')c^ ^ is an averaged estimate of s lift(A, D')c- ^ = 1 

and the two sums run over the core attributes A' and of topics i and J, 
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respectively. Also, we may add a third summation including lift (A, F') where F' 
is a core attribute belonging to topic I into which B does not belong to, as then 
Qi,b = 0 and the whole term vanishes. This observation again generalizes to the 
case where B is generated by multiple topics. 

The above reasoning included approximations in discarding high-order terms 
and the somewhat crude assumption that all Si are equal. In any case, it does 
yield an idea of how to discover topics: for an attribute B that belongs to 
several topics, define a vector a whose length is the total number of all core 
attributes. The element corresponding to A (a core attribute of topic i) is 
a A = Qi,Bl{ci Qj.b)- Then lift(A, B) « lift(A, •) for all core attributes A, 
where we denote by lift (A, •) the vector of lifts between A and all core attributes 
(where lift(A, A) = 0). This gives us an algorithm for finding the topics in which 
the attributes belong, and also the parameters Q: 

— Identify those attributes that belong to one topic only - this can be done 
by looking at the lift statistics, which are always either 1 or 1/s for those 
attributes. 

— Cluster those attributes using some traditional clustering algorithm; at this 
stage the clusters do not overlap and do not cover all attributes - if an 
attribute B belongs to several topics, its lifts are intermediate between 1 
and 1/s, and so B is not clustered. For A belonging to one topic i only, 
Qi^A = P{AA') / P{A') which can be averaged over all A' belonging to the 
same topic i as A. 

— For attributes B which are not clustered, find a decomposition lift(i?, •) = 
OL^ R, where the square symmetric matrix R has the vectors lift(A, •) (of 
already clustered attributes A) as its columns. All of the lifts in this formula 
are known, so the vector a. can be estimated straightforwardly. The elements 
of OL are nonzero for those attributes that share a topic with B, and zero 
for others. Also, the elements are more or less constant within attributes 
of a given topic. Now Qi^B = ctACij Qj^B where ua can be averaged 
over all A' belonging to topic i, Ci is known, and for small and equal Sj we 
can approximate P{B) « Qj,B, which gives us Qj^b- We can also 
assume Qj^b = 1 and scale the estimated Qi^B accordingly. 

4 Experimental Results 

4.1 Generated Data 

We designed experiments to see how the conspiracy statistic 0 of a model af- 
fects our clustering results. The results corroborate our conjecture that low- 
conspiracy models are easier to discover. We constructed random models with 
0-bounded conspiracy using the following recipe. The model has 10 topics and 
100 attributes. The probability Si of a topic was drawn uniformly at random 
from the interval [0.01, 0.5]. Each attribute was assigned a primary topic so that 
each topic was primary for 10 attributes. 
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To assign the within-topic attribute probabilities Qi^A so that the conspiracy 
parameter is 0, we first drew a number p uniformly from [0, 1] and let = V 
for the primary topic i. Then we distributed the mass 9p to the non-primary 
topics in an uneven way. Each non-primary topic in random order received a 
fraction of </> of the remaining mass, where (j) is chosen at random from [0,1], 
separately for each non-primary topic. The last topic received all remaining mass 
to make the mass sum up exactly to Op. 

This way of generating a random model includes a number of somewhat 
arbitrary choices that we now justify. First, the topic probabilities Si were chosen 
not from [0, 1] but from a smaller interval. Some lower limit is necessary so 
that each topic is represented in a finite data sample; and an upper limit is 
needed by our algorithm, which distinguishes a topic by estimating its probability 
and cannot discover a topic that is almost always active. In a preliminary test 
(not shown), our algorithm’s performance was best with low upper limits, and 
deteriorated rapidly when the upper limit approached 1. We chose 0.5 as the 
upper limit as a conservative approach: in document data, one would expect 
that individual topics have much smaller probabilities. 

Second, we discuss the distribution of the within-topic attribute probabilities 
of non-primary topics. A more obvious strategy would be to draw the probabili- 
ties independently and then to normalize, but then the distribution would have 
become more even. With 9 non-primary topics, all the probabilities would cen- 
ter around 0/9 times the primary probability, which makes the task far easier: 
none of the non-primary topics is likely to be confused with the primary topic. 
In contrast, our procedure typically results in a few non-primary topics with 
non-negligible topic-attribute probabilities for each attribute. We wish to mimic 
the behavior of true data sets, such as text document data: a term may have 
several meanings, perhaps a primary meaning and one or few secondary mean- 
ings, hence it belongs primarily to one topic of discussion and secondarily to a 
few other topics, but not to all possible topics. 

In the experiment, we estimated the topic-attribute probabilities Q using 
the lift statistic, NMF, PLSA^ and K-means. The NMF and PLSA methods 
estimate Q given the observed binary data. A naive alternative is the simple 
K-means algorithm which clusters the attributes into non-overlapping sets; we 
assume that Qi^A is equal for all attributes A of topic i and sums to 1 at each 
topic. 

Figure 1 shows the mean squared errors (MSE’s) of the estimated Q, com- 
pared to the true probabilities used to generate the data. The conspiracy param- 
eter 0 runs from 0 to 1. At each 0, the topic probabilities s are sampled anew, so 
there is great variability in the data models. Originally, the topic-attribute prob- 
abilities estimated by the methods do not necessarily sum to 1 at each topic - 
they do in PLSA, but not either in the other methods or in the true data model - 
but we scale them accordingly, to be able to compare the MSE’s. 

In Figure I we see that at smaller 0, the Lift algorithm estimates the Q and 
thus the structure of the data very nicely. When 0 grows very large, the data 

^ The PLSA method was kindly programmed by Mr. Teemu Hirsimaki. 
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model is more difficult to estimate. The behaviors of NMF and PLSA^ do not 
depend on 6, which is natural: the methods are not primarily aimed for such 
0-bounded data but instead are able to estimate the structure also when the 
topics are totally overlapping. The K-means algorithm estimates the structure 
of the data poorly for all 9. 

4.2 Real Data 

We performed experiments on bibliographical data on computer science available 
on the WWW*. We first tested the model’s prediction that lift (A, i?) > 1 for 
all A, B; while it does not hold perfectly because there are negative correlations 
between words, the vast majority of these negative correlations are statistically 
insignificant (details omitted). We preprocessed the data by removing a small 
set of stop words and all numbers, and then selected the 100 most frequent terms 
for further analysis. 

We computed the lift statistics between all term pairs and used hierarchical 
average linkage clustering based on the inverses of lifts. Table 1 shows how the 
terms are clustered into topics. The number of clusters (21) was chosen based 
on the distance between clusters being merged in the process of hierarchical 
clustering: until these 21 clusters, the intercluster distances were quite small but 
distances between the final 21 clusters were large. The structure in Table 1 is 
immediately familiar to a theoretical computer scientist: the topics concentrate 
on different fields of the science. 

We also performed topic finding on yeast gene expression data, using the same 
gene expression dataset as in [24] that combines the results of several different 
gene expression studies. The combined dataset measures the expression level of 
over six thousand genes in almost a hundred experiments; thus, we used the 
experiments as “attributes” and the genes as “measurements”. The levels were 
discretized so that the top 5% expressed genes in each experiment were given 
the value 1. The results are not shown due to space constraints, but as a brief 
example, the discovered topics were seen to reflect cyclical behavior of the genes 
in the time-series experiments. 

5 Concluding Remarks 

We studied a simple generative topic model and showed that the lift statistics of 
attributes can be described in matrix form. Based on this, we obtained a simple 
algorithm for finding topics in 0-1 data. We also showed that a problem related 
to the identification of topics is NP-hard, and gave experimental results. 

Several open problems remain. Our model is simple, and seems to yield good 
results; still, more complex models might do a better job at identifying, e.g., 
topics containing partly exclusive attributes. The identifiability of the model is 
another interesting issue: could one prove something about it? Further experi- 
mental studies are also needed. 

® No simulated annealing was used in the EM algorithm of the PLSA. 
http : / /liinwww. ira.uka.de/bibliography/Theory/Seiferas/ 
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Fig. 1. Mean squared errors of Q at different conspiracy parameters 9 . Lift *, 
NMF O, PLSA o, K-means •. 

Table 1. Terms in different topics. (The order of the topics is not relevant), 
topic terms 

1 algorithms approximation damath problems scheduling some tree two 

2 analysis distributed libtr probabilistic systems 

3 bounds communication complexity foes lower 

4 algorithm efficient fast ipl matching problem set simple 

5 design ieeetc network networks optimal parallel routing sorting 

6 note tes 

7 finding graphs minimum planar polynomial sets sicomp time 

8 graph number properties random tr 

9 from information learning Incs theory 

10 approach jacm linear new programming system 

11 actainf binary search trees 

12 abstract computation extended model stoc 

13 automata finite languages mfes 

14 data dynamic infctrl logic programs structures using 

15 applications icalp theorem 

16 cacm computer computing science 

17 crypto functions 

18 jess machines 

19 algebraic beates computational geometry 

20 de stacs van 

21 codes dmath 
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Appendix 

Proof of Theorem 1. That the problem is in NP is simple to see: the certifi- 
cate is the topic vector t, and the formula for P{x \ t,T) involves multiplying 
n numbers, each computable in 0{k) time. 

To show NP-hardness, we reduce SAT to a topic assignment problem. Given 
a SAT instance of m clauses over n variables, we define a topic model with 
2n topics and n -I- m attributes. For each variable F), we create two topics Ti 
and T/, and one attribute Ai. For each clause Cj, we create one attribute Bj. 
Each topic has probability 0.5, and each attribute has 0/1 within-topic proba- 
bilities as follows: attribute Ai has probability 1 in topics Ti and T/ and prob- 
ability 0 in other topics; attribute Bj has probability 1 in the topics Tj such 
that Vi appears positively in clause Cj and in the topics T/ such that F) appears 
negatively in clause Cj, and probability 0 in all other topics. We consider a data 
vector where all attributes have value 1. 

Now, if the SAT problem has a satisfying truth assignment, it corresponds 
to a solution of the topic assignment problem where Tj is active if Vi is true and 
T/ is active if Vi is false. This solution has likelihood 0.5”, since exactly n topics 
are active, and the active topics explain all attributes Ai and Bj. Conversely, 
if a solution to the topic assignment problem exists such that the likelihood is 
at least 0.5”, it must have at most n active topics. To explain attribute Ai, 
either Tj or T/ must be active; thus the number of active topics is exactly n, 
and the solution corresponds to a truth assignment. Since the solution must 
also explain each attribute Bj, the truth assignment must satisfy the original 
problem. In summary, the SAT instance has a solution if and only if the topic 
assignment problem has a solution with likelihood at least 0.5”. □ 
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Abstract. We present an inductive logic programming bottom-up learning al- 
gorithm (BFOIL) for synthesizing logic programs for multi-slot information ex- 
traction from hypertext documents. BFOIL learns from positive examples only 
and uses a logical representation for hypertext documents based on the document 
object model (DOM). We briefly discuss several BFOIL rehnements and show 
very promising results of our IE system LIPX in comparison to state of the art IE 
systems. 



1 Introduction 

In the last decade several techniques and systems based on relational learning in the 
area of information extraction (IE) have been developed [10] . Though a handful ap- 
proaches [1,2, 5] exist which capture the idea of bottom-up and top-down rule learning 
inspired by inductive logic programming (ILP) [12], it is surprising that almost no sys- 
tem [8] tries to follow a pure logical ILP based approach. ILP in general offers broad 
varieties to be adapted to different problem domains by simply changing the problem 
representation and/or the hypothesis language. Our aim is to develop an algorithm for 
learning multi-slot wrappers for hypertext documents, based on logic programming and 
ILP concepts. This technology can easily be extended with additional information on 
the representational level (document pre-processing and hypothesis language) and al- 
gorithmic level (semantic least general generalization operators). 

In Section 2 and 3 we introduce a DOM [4] based representation for hypertext doc- 
uments and relational representation of text examples. Section 4 briefly explains the 
hypothesis language and derived example descriptions used for latter bottom-up learn- 
ing. The Bottom-up First Order Inductive Learning algorithm and results are presented 
in Section 5 and 6. 



2 Document Representation 

Throughout this paper we will focus on HTML documents. It should be noted that the 
approach presented in this paper is easily adaptable to XML or similar tag-based lan- 
guages. In order to capture and model the syntactical and hierarchical aspects of HTML 
and XML documents we define the concept of TDOM-trees, which is strongly related 
to that of a document object model (DOM-tree). A node in a TDOM-tree consists of 
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four features: a document reference (Di^), a node identifier the corresponding to- 
ken t describing the document text denoted by the node and an ordered list of child 
node identifiers {[chi ,. . . Thus we represent a node in a TDOM-tree as a term 

node{Diii,niii,t, [chi,ch 2 , ■ ■ ■ The basic intention of tokens is, like in most other 

approaches, to group symbols from the text separated by white spaces or other separa- 
tors to typed words like integer, date, html-tags etc. Each token is represented as a term 
with a list of feature-value pairs, which is given by: token{[fi,vi],. • • , [/«, v„]), where f, 
is an arbitrary feature name and v, is an arbitrary feature value with i= l,n. For exam- 
ple Tok{<img src = ”a.jpg”>) = {token{[{ttype,html) , {value, ’<img src = ”a.jpg”> 
') , (■spos, 0) , (epos, 16) , (tag, img) , (src,' a. jpg')])}. 

Node identifiers are terms representing a path from the root node to a node in the 
TDOM. To illustrate the idea of node identifiers assume every node in a tree is as- 
signed a unique number. The function child : No x No ^ No computes for a given 
node number i and n G Nq the «-th unique child node of i. For example the term 
child (child (child (root, 1),0),3) refers to the fourth child of the first child of the sec- 
ond child of the root node in the TDOM. For better readability and later handling we 
use a prolog list notation [1,0,3], leaving out the root node, to denote node identifiers. 
Hence a node identifier is used fo assign a unique term to each node in a TDOM. It also 
provides information about the position in the TDOM-tree. In fact, the notation of node 
identifiers is sfrongly related to the Dewey-Notation [18]. A leaf node in a DOM-tree 
represents text appearing at the ’’surface” of the hypertext document. For example a 
whole paragraph may be associated with one leaf node in a DOM-tree. In many cases, 
this representation is not accurate enough for IE tasks. We modify the concept of a 
DOM-tree such that a leaf node in a DOM-tree becomes many leaf nodes in a TDOM- 
tree. Each of these nodes represent one token from the text. 

Given this notation, an arbitrary HTMF document D can be represented as a set of 
ground unit clauses describing a TDOM model of D. T (D,) denotes the TDOM of D 
with Did = i- A ‘T(Do) representation for an example HTMF page is shown in Figure 
1 . To be able to compare node identifiers we define fhe following order relation. A node 
identifier n, is smaller than a node identifier nj wriften n, < nj iff 3x G No : nj.x > 
Hi.x A^y G No :y < X it holds that nj.y = ni.x where denotes the «-th child number 
(starting from left) of a node identifier. Two node identifiers n, and nj are equal if they 
have the same length and n, yt nj A Uj -fi. n,. For example: [0, 0, 3] < [0,2]. 

Node identifiers have nice properties for wrapper-learning. Similar to expressions 
in the XPATH language [19] node identifier expressions can be used to refer to more 
than one node by the use of variables. The node identifier [0, 1 , 1 , A] refers to every child 
node of the <ul> environment of Figure 1. For example, the term [A, 3] refers to all child 
nodes of the root nodes with at least 3 child nodes. It is important to point out that vari- 
ables can only be substituted by one value and not by partial node identifier expression 
like [0,1]. Furthermore additional constraints can be introduced by using one variable 
more than once (e.g. [0,A,2,A,0]) or more than one variable (e.g. [T,A,2,A,T] . Then 
pattern variables with the same name are not treated disjunctively and thus have to be 
instantiated with the same value. In fact, in the XPATH query language such expres- 
sions can only be expressed by means of iterative programming language constructs 
like for-loops and thus are not as elegant and compact and easy to handle. 
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<htmlxheadxtitle>Example</titlex/head> 
<bodyxh1>Example</h1xulxli> A simple <b>example</b> 
<lh>of a TDOIVI-tree.<Ajlx/bodyx/titml> 

documen 




0,1,1,01 

b 





m 


simple ^ 1 


( \ 

example 

'x J 


[0,0,0,01 [0,1,0,01 


[0,1,1,0,{^ 


, [0.1,1,0,11 


[0,1,1,0,2,01 



node(0,[0,1,1,0], 

token([(ttype,html),(value,'<li>'),(tag,li),(spos,1 16),(epos,'119')l, 
[[ 0 , 1 , 1 , 0 , 01 , [ 0 , 1 , 1 , 0 , 11 , [ 0 , 1 , 1 , 0 , 211 ). 



TDOM node 



Fig. 1. HTML document, simplified TDOM-tree, TDOM node and span ([0,1, LO], 1,2) 

This notation makes it easy to generalize on node identifiers by means of Igg opera- 
tions [15], Assume one text example is located in one document in node [0, 1 , 1 , 0 , 0] and 
in the other document in node [0, 1 , 1 , 1 , 0] . A reasonable first step in learning an extrac- 
tion rule is the assumption that all nodes described by the generalized node identifier 
[0, 1 , 1 , A, 0] are good extractions. 

3 Example Representation 

One essential concept of our approach is that of span. Informally spoken a span deter- 
mines a subtree in a TDOM-tree. We pick up the idea mentioned by [3] where a span 
is defined as a triple consisting of a node identifier N and a left and right delimiter L,R. 
Delimiters determine the left and right boundaries of an interval of child nodes con- 
tained in a span. For example the span ([0, 1, 1,0], 1,2) of the example TDOM (Figure 
1) refers to the set of node identifiers {[0, 1, 1,0, 1], [0, 1, 1,0,2], [0, 1, 1,0, 2,0]}. More 
precise: a span S = (N,R,L) is the set of all reachable descendant nodes starting at the 
i-th child node of node N with i = R..L. In general we assume a depth first traversal to 
enumerate all nodes of a span to ensure the left to right order of the text at the surface 
of a document. 

A minimal example span MS for a given text T is the span with the least cardinality 
including the text T . For example let T be a text fragment from the document (Figure 
1) like simple example and Si be a span with ([0, 1, 1],0, 1) and S 2 be the span from 
our previous example. Clearly both Si and S 2 contain T but card{Si) > card{S 2 ) and 
therefore S 2 is the only existing minimal example span of T with respect to the example 
TDOM because: ^35' : card(S') < card{S 2 ) where S' is a span including T . 
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For the rest of the paper we focus on multi-slot extraction tasks, where a text exam- 
ple t with n slots consists of a tuple of texts <t \ taken from a document D. The 
initial example set of text tuples is denoted by Ej. Given D and t we dehne the example 
representation of t with respect to D as ef \=«s\,. . . ,Sn>,<ti,. . . ,t„» where Si is 
the minimal example span of t, with i = l..« in ‘T{D). For later purposes we dehne 
the notion of a validation set given by VS{Ej,p) = [■?!) • • • [^i , • • • Tn]) 1 1 G 

gD =«^1, . . .,Sn>, <tl, . . . ,tn»} 

Further we take some assumptions according to the presentation of examples: 1) t\ 
to t„ do not dehne a particular order of occurrences of f, in D (i.e. we can not follow 
that ti occurs before tj+i in D). 2) Each f, is associated with an intended semantics 
(e.g. ti describes the ZIP code) 3) Missing slot hllers in the text (e.g. no ZIP held or 
placeholder stated in the text) or empty slot hllers (e.g. there is a ZIP placeholder but 
no code is given) are represented by the empty string " " . 



4 Hypothesis Language 

This section will cover three questions: given an example representation which impor- 
tant relational properties can be observed (Section 4.1)? How can these observations 
be represented? How are these representations used to dehne a hypothesis language for 
inductive learning of extraction rules (Section 4.2)? 



4.1 Observing Example Properties 

We write s.n, s.l and s.r to refer to the components of a span s := (n,l,r). Given an 
example representation ef* we investigate each tuple argument t, and its span 5, accord- 
ing to the following four levels. Note, the following predicates can be exchanged by 
arbitrary other ones describing relational information regarding the training examples. 

Structural Level: the position of a span Si and its neighbor nodes are investigated: 
xpath{Diii,s,tl) holds if Dili is a document id (Section 2), j is a span and tl is the list of 
tokens associated with each node following the path from the root node to the node of 
s. 

xspan{Diii,s,tl) holds if tl is the associated list of tokens of all nodes of span s. 
xright Jbrother{Pid,n,tr) holds if n is a node identiher and tr is the associated token of 
the right neighbor node of n. Analogously we define a left brother predicate. 

Textual or Content Level: a relation between the example text, its tokens associated 
with the leaf nodes and its span is dehned: 

spanJextjmdJokens{Diii,s,t,tl) holds if tl is the list of tokens associated with all leaf 
nodes of span s for text t. 

Delimiter Level: predicates to incorporate a widespread idea of IE approaches to learn 
right and left delimiters of relevant text parts are dehned: 

start ^ndjiodes{Didd ,ni, nr) holds if ni is the start node and the end node of text t 
in T (D) referred to by Did ■ 
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xpredecessor{Diii,n,rii,tl) holds if the token list tl contains the tokens associated with 
the first n nodes we meet going backwards in a depth first search* to n,. Analogously 
we define xsuccessor to collect all n successor tokens we meet by a depth first traversal 
after having met n, . We call n the context distance. 



extract(D, [[0,1,0,9,X,5] : 0 : R], [[EIER]]) 
xpath(D, [0,1,0,9,X,5], 

[token([(ttype,html), (value, '<html>'), (tag.html), (spos.'O'), (epos, '5')]), 
token([(ttype,html), (value, '<center>'), (tag, center), (spos,'136'), (epos, '143')]), 
token([(ttype,html), (value, '<nobr>'), (tag,nobr), (spos,'146'), (epos, '151')]), 
token([(ttype,html), (value, '<table border cellpaddlng,2>'), (tag,table), (border,"), 
(cellpadding, '2'), (spos,'1090'), (epos, '1 1 17')]), 
token([(ttype,html), (value, '<tr>'), (tag,tr), (spos,V19), (epos,V20)]), 
unlfy(C1, 0), unlfy(C2, R), member(C1, [0]), member(C2, [0, 1, 2]), 
xspan(D, [0,1,0,9,X,5] : 0 : R, 

[token([(ttype,html), (value, VI), (tag,td), (align, V2), (spos,V3),(epos,V4)]), T1ITR1]), 
xlett_brother(D, [0,1,0,9,X,5], 

[token([(ttype,html), ((value, '<td align, rlght>')), (tag,td), (align, right), (spos,V5), (epos,V6)])]), 
xrlght_brother(D, [0,1,0,9,X,5], 

[token([(ttype,html), (value, V7), (tag,td), (align, V8), (spos,V9), (epos, VI 0)])]), 
span_text_and_tokens(D, [0,1,0,9,X,5] : 0 : R, [EIER], [T1ITR1]), 
start_end_nodes(D, [0,1 ,0,9,X,5] : 0 : R, [0,1,0,9,X,5,0], [0,1 ,0,9,X,5,R]), 
xpredecessor(D, 7, [0,1,0,9,X,5,0], 

[token([(ttype,html), value,V1, tag,td, align, V2, spos,V3, epos,V4), 
token([(ttype,html), (value, '<td align, rlght>'), (tag,td), (align, right), (spos,V5), (epos,V6)]), 
token([(ttype,html), (value, '<td align, rlght>'), (tag,td), (align, right), (spos,V11), (epos, VI 2)]), 
token([(ttype,html), (value, '<td align, rlght>'), (tag,td), (align, right), (spos,V13), (epos,V14)]), 
token([(ttype,html), (value, '<td align, left>'), (tag,td), (align, left), (spos,V15), (epos, VI 6)]), 
token([(ttype,html), (value, '<td>'), (tag,td), (spos,V17), (epos, VI 8)]), 
token([(ttype,html), (value, '<tr>'), (tag,tr), (spos,V19), (epos,V20)])]), 
xsuccessor(D, 7, [0,1 ,0,9,X,5,R], 

[TO, token([(ttype,V21), (value, V22), (spos,V23), (epos,V24)]), T2, T3, T4, T5, T6]), 
xsmallest_common_span(D, [[0,1,0,9,X,5] : 0 : R], [0,1 ,0,9,X,5] : 0 : R, 

[token([(ttype,html), (value, VI), (tag,td), (align, V2), (spos,V3), (epos,V4)])]). 



Fig. 2. Learned single-slot mle for QS-vol 



Relational Span Level; to figure out relations between spans we define: 
xsamespanjiode{Diij,Si,Sj) holds if n; and nj of spans i,- = (nj,li,ri) and Sj = 
(nj,lj,rj) are unifiable. 

xnodeJess{Did ,ni,n j^dist) holds if m < nj. Where dist is the list of differences be- 
tween the components of nj and n; (e.g. vnor/eJej5(0, [1,4,0], [2,3,0,2], [1, — 1,0])). 
Analogously we define xnode-greater. 

overlapping jpan{Did, Si, tli,Sj,tlj) holds if (x,.Z < sj.l) A (s,-.r > Sj.l) A (x,.r < Sj.r) 
where tU and tlj are the corresponding token lists of j, and Sj. 
spanJn^pan{Pid,Si,Sj) holds if span s, is a subtree of span Sj. 

* This captures the idea to interpret the document as a sequence of tokens rather than a tree, and 
we investigate the n preceding tokens of the token associated with n,-. 
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xsub_related^pan{Did,Si,Sj) holds if Sj.n is a prefix of Si.n (e.g. [1,2] is a prefix of 
[1,2,3]). 

xsmallest_common^pan{Did, [i'l, . . ■ ,Sn],Sx,tkx) holds if Sx is the smallest span (wrt. to 
its number of nodes) in such that each span s, with i= l..n is a snbtree of Sx and tkx 
is the associated token with Sx-n. 



4.2 Clause Descriptions of Examples 

Now that we have defined predicates for the description of text example properties 
based on the representation of a TDOM, we introduce the concept of a clause descrip- 
tion CD{ef) for an example representation ef . In terms of extensional and intentional 
object languages a clause description of an example is an intentional object description. 
Furthermore we use the same language for the description of objects and hypotheses. 

Let Lh be the set of predicates introduced in Section 4.1. We call this the hypothe- 
sis language which is used later for construction of rules. This is in analogy to standard 
ILP algorithms like FOIL [16]. It shonld be noted that the hypothesis language can be 
freely chosen. Furthermore let us assume that a logic program Px// is given that imple- 
ments the intended semantics of the predicates in Lh- To denote the union of Px^ and 
T (D,) we write P£^ . Now we can define CD(ef’) = {/q | P£^ F with I G Lh and /' is 
I instantiated according to its given semantics with € ef* and o calculated answer 
substitution}. Here F denotes the logical derivation operator and we assume a standard 
logical calculns (e.g. SLD-Resolntion [11]). 

Finally we define P+ to be the set of clause descriptions for a given set of examples 
as£+ = Uf(=£fl CD{ef). Additionally we extend every CD{ef) with a special predicate, 
the rule head defined as extract {Did, [■?!, • • • An]) [L, • • • Tn])> where every s, and t, is in- 
stantiated with the associated argument from ef . Then CD(ef’) forms a ground instanti- 
ated rule of the form extract {Did, [s\, - ■ ■ ,Sn\,\h, ■ ■ ■ dn]) ^ h, - ■ ■ ,ln with G Lh- Since 
we focus only on learning non recursive horn clauses, we do not have to use negation 
operators for the body literals and consider a marked predicate (e.g. extract) to build 
the head and all other literals in CD{e^) to form the body of a rnle. Thns every CD{ef) 
is one rule describing exactly one text example with respect to D and Lh- Accordingly 
computing all answers to the query F extract{Did, [^i, . . . ,s„], [ti, . . . ,f„]) pro- 

vides the validation set V S{Ej , extract) (Section 3). 

5 BFOIL Algorithm 

The central idea of BFOIL is to learn in a bottom-up fashion from positive examples 
only a set of rnles by means of least general generalization techniques [15]. In contrast 
to the standard Top-Down learning approaches starting with the most general hypothesis 
BFOIL starts with a set of gronnd rules (clause descriptions) as initial hypothesis and 
tries to generalize these clause sets by means of Igg operations. The term clause-lgg 
denotes the Igg of two clauses Ci and C 2 defined as clause— lgg{C\ ,€ 2 ) = {lgg{l,m) \ 
I & C\ f\m & C 2 /\lgg{l ,m) is defined} . In general the clause-lgg of two clauses has to be 
reduced, in the sense that redundant literals under 9-subsnmption have to be removed. 
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Algorithm 5.1 Basic BFOIL algorithm 
Require: P = logic program ; Ej = positive examples 

1: LearnedRules ^ & 

2 : while £+ 7 ^ 0 do 
3: Rule e £+ 

4: E+^E+\{Rule} 

5: ProblemSet ^ 0 

6 : while E+ 7 ^ 0 do 

7: A e E+ 

8 : R ^ clause Jgg{Rule,X) 

9: \iapply{R,P,E^).fp> Othtn 

10: ProblemSet ^ ProblemSet U {A} 

1 1 : else 

12: Rule ^ R 

13: E+^E+\{X} 

14: LearnedRules ^ LearnedRules U {Rule} 



Since it is obvious that calculating the clause-lgg of £+ results in one rule that over- 
generalizes with high probability, BFOIL inductively tries to partition £+ into sets of 
clauses Q C Z?+ such that the clause-lgg of each C, forms a new rule that does not 
produce any false positive predictions (extractions). Since we only learn from positive 



Function 5.2 apply(R,P,V) with false positive calculation 
Require: R :=rule ; P = logic program ; V = examples 
1: A [Rhead^ I PU A h Rhead^ with o answer subst.} 

2 : fp^\A\{VS{V,Rhead)r\A)\ 



examples, standard techniques to determine false predictions during the learning phase 
(validation on negative example sets) are not applicable. To yield good rules anyhow, 
it is essential to estimate the correctness of rules during learning. Thus we assume that 
the set Ej is exhaustively enumerated. This means every intended extraction from D 
is contained in Ej. Then we can conclude that if a rule extracts a tuple t from D with 
t ^ Ej it is false positive. This introduces a closed world assumption [17] similar view 
on extraction examples and the absence of negative training data. 

This seems to be a very strong restriction which requires tedious labeling. But since 
our approach does not need many examples (5-30 training examples Section 6 Figure 
3) only a small number of documents have to be labeled. 

In general an IE learning task has to deal with multiple documents Di . . .Dn and 
examples drawn from Di ...D„ then we define Ej = |J”=i Ej' . Additionally we assume 
that the logic program P is an implementation of Lh U (Ui=i ‘E(Di)). Algorithm 5.1 
shows the basic BFOIL algorithm and Function 5.2 the function apply for calculating 
false positives. In the best case basic BFOIL returns one rule, the clause-lgg of £+. 
Experiments showed that this happens if examples are identical wrt. to their structural 
properties in a TDOM. In the worst case basic BEOIL just memorizes each clause in 
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£’+. This might happen if examples are too different wrt. to the expressiveness of Lfj 
and the clause-lgg leads to over- generalized rules. 



5.1 BFOIL Refinements 

The results of using basic BFOIL to multi-slot extraction are not satisfying. Imagine 
a clause C\ = CD{ef)) with {xpath{Q, [1,2], [. . .]), . . . ,xpath{0, [1,3, 1], [. . .]), . . .}. The 
intention of this clause is that the first literal describes path features of the first argument 
and the second literal describes path features of the second argument of an example. 



Algorithm 5.3 Consistent BFOIL algorithm 
Require: P = logic program ; Ej = positive examples 

1: LearnedRules ^ & 

2: while E^ 0 do 
3: Rule e £+ 

4: E+^E+\{Rule} 

5: ProblemSet ^ 0 

6: C^0 

7: while E^ 7 ^ 0 do 

8 : A e E+ 

9: R ^ clause Jgg{Rule,X) 

10: if (apply{R,P,E^,E+^^,C).fp>0) 

or (not apply{R,P,E^,E+^^^ ,C). consistent) then 
1 1 : ProblemSet ^ ProblemSet U {A} 

12: else 

13: Rule ^ R 

14: C<-CU{X} 

15: E+s-E+\{X} 

16: LearnedRules ^ LearnedRules U {Rule} 



Calculating the clause-lgg of C\ and C 2 generalizes each xpath literal in Ci with each 
xpath literal in C 2 . This is not what we want. Only the Igg of xpath literals describing 
the same argument i should be calculated from both clauses. With a simple syntactic 
transformation before the calculation of an Igg and re-transformation before evaluation 
of a generalized clause (rule) we can still use the standard Igg operation for learning. 
Adding a prefix argi fo every predicate symbol of each literal in £+ prevents the Igg to 
generalize from non-intended literals. This prefix protection is more an issue of repre- 
sentation than a refinement of the BFOIL algorithm. 

The basic BFOIL algorithm is not consistent (e.g. learned rules may not cover ex- 
amples from Imagine two examples e\ and e 2 - The second argument of ei is 

empty. Due to P and Lh the clause description for ei would not contain literals for the 
description of argument 2 to reduce the complexity associated with empty substitutions. 
Because of the absence of these literals the clause-lgg eliminates the literals for argu- 
ment 2 from clause two. It is possible that the new rule still covers ei and does not 
produce any false positives, but does not cover e 2 anymore. For this reason, we keep 
track of examples that had been used successfully for learning the current rule (line 14 
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Algorithm 5.3). Every rule refinement (line 9) must cover all examples that have been 
successfully used in previous learning steps (line 10). Function 5.4 implements this test. 



Function 5.4 apply(R,P,V,L,C) with consistency check 
Require: R :=mle ; P = logic program ; 

L(ZV = examples ; C = example descriptions 
1: A ^ {Rhead’^ I PUf? h Rhead^ with o answer subst.} 

2: fp^\A\{VSiy,Rhead)r\A)\ 

3: consistent ^ true 
4: while C ^ V> A consistent do 
5: Cg e C 

6: e G L A e is described by Ce 

7: ife^Athen 

8: consistent «— false 

9: else 

10: C^C\{c,} 



A third refinement of BFOIL is the modification of the clause-lgg operator. There- 
fore we introduce the concept of a semantic Igg operator. Semantic Igg operators are 
closely related to the chosen hypothesis language and example representation in gen- 
eral. The key idea is to guide the Igg operation by additional knowledge to prevent 
over-generalization. For example the Igg of spans and the generalization of xspan lit- 
erals tend to blow up the search space. The Igg of xspan{Q,{[l,2,3],3,6),[...]) and 
xj/7an(0, ([1,2, 3], 1,10), [...]) is xx/7an(0, ([1,2, 3], X,T), [...]) which is obviously to 
general from a practical point of view. For this reason we define additional semanti- 
cal Igg operators. These operators provide semantical based generalization by adding 
special literals to the Igg of two clauses. We denote a semantic Igg operator similar to 
an inference rule: 

Cl\{xspan{Du{NuLuRi),TLi)} C2\{xspan(D2,{N2,L2,R2)JL2)} 

{member {L,[L i,...,L 2]), member {R,[R I,..., R2])} U CL 

with CL = clause Jgg{Cl,C2) and xspan{D,{N,L,R),TL) gCL. 

Extending the standard clause-lgg with semantic Igg operators can reduce the search 
space significantly, resulting in faster learning and extraction times. Especially if spans 
in a document are huge, the insertion of the member predicates are of practical rele- 
vance. Instead of considering all possible instances for the left and right delimiter of 
the span, they are constrained to take only values between the smallest and the greatest 
value seen so far. All results presented in this paper have been generated by using only 
one semantic Igg operator, that is for the xspan literal. 

6 Results and Conclusion 

We tested the BFOIL algorithm with our extraction system LIPX on the RISE repository 
[13]. RISE contains document resources with an extraction task description taken from 
various IE research papers and projects. Most publications refer to these problem cases 
as kind of standard tests. Unfortunately not all approaches give a complete overview 
of their results with respect to precision and recall values. We focused on extraction 
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tasks from HTML documents only and learned multi-slot extraction rules for HTML 
resources as described in the RISE repository. 

All tests were ran using a fixed 
number of randomly drawn examples 
to perform 20 learning and test runs 
for each problem class. The settings 
are shown in the first table of Figure 3 
where t = no. total tuples; = no. ex- 
amples and r = average no. of learned 
rules. For each problem class the 
learning examples were randomly 
drawn from one half of the available 
documents. The testing set consisted 
of all documents, but only the data tu- 
ples not used for learning were con- 
sidered. An extraction was counted as 
correct when all of its slots where correctly extracted. Values for precision, recall and 
FI are displayed in percentages, all other values in totals. For all tests we used the hy- 
pothesis language described in Section 4.1 with context distance n = l . Figure 4 shows 
the best FI (harmonic mean of precision and recall) values. 

Comparing LIPX results with other 
multi-slot lE-systems is not straightfor- 
ward, because almost all systems set up 
different evaluation scenarios with respect 
to the number of examples, their selection 
criteria and the number of test iterations. 

The first table of Figure 5 shows the results 
(median) for single and multi-slot learning 
in comparison^, to the systems SoftMealy 
[7], Stalker [14] and Wien [9]. Even though Fig- 4. Best FI multi-slot results 

LIPX is developed for multi-slot tasks we tested it on single slot extraction tasks to pro- 
vide a comparison to one state of the art single slot extraction approach (BWI) of [6]. 
These results are listed in the second part of Figure 5. While learning single slot wrap- 
pers supersedes the relational span level predicates the single slot learning results also 
underline the high precision values observed with multi-slot learning. In 5 out of 7 cases 
LIPX shows better or equal precision values than BWI and BWI HMM. This is not too 
surprising, because in the worst case BFOIL only memorizes the examples. This does 
not happened with these test cases, but in two cases the FI rates due to the low recall rate 
are not acceptable. There are mainly two interacting reasons for this behavior, which 
build a general observation for multi and single-slot learning. First, BFOIL seems to 
yield bad recall rates if only a few examples are present and those differ strongly re- 
garding their relational description (Quote, lAF-altname). Secondly, some tests where 
run with only a few training examples, because of the bad runtime behavior caused by 
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^ All values for SoftMealy, Stalker and Wien are taken from [7]. 
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BFOIL’s naive answer set computing for each generalized rule. Consequently the recall 
rate was low (LA, LA-cc, CS-name). 

The presented approach offers a 
wide variety for extensions by mod- 
ifying the token representation of 
text units for richer semantic text 
pre-processing. This allows to incor- 
porate linguistic or additional gen- 
eral semantic information. By mod- 
ihcation of the underlying hypoth- 
esis language we can adapt the 
presented approach to other mark 
up languages or focus on differ- 
ent relationships than those stated in 
this paper. Using natural language 
tools (e.g. part of speech tagger) for 
the pre-processing of documents in 
combination with an XML repre- 
sentation of such pre-processed doc- 
uments also allows us to apply our 
methods to natural language texts. 

By extending the BFOIL algorithm 
with additional semantic Igg opera- 
tors the hypothesis search space can 
be constrained and runtime behav- 
ior improved. An additional modih- 
cation to increase the recall rate is, 
to accept rules that cover a small 
number of false positives. This mod- 
ification was not tested yet. But it 
is easily accomplished by incorpo- 
rating a threshold (e.g. if the per- 
centage of false positive extractions 
is below 0.03 % (algorithm 5.1 line 
9)). These observations show that all 
results presented in this paper de- 
pend strongly on the chosen hypothesis language and the degree of additional informa- 
tion chosen for the representation of TDOM nodes. So far we only made experiments 
with the one mentioned in Section 4. 1 without any fine tuning (e.g. context distance, 
sem-lgg). LIPX shows partially bad learning time results, which clearly stems from the 
combinatorial explosion while applying a rule that became too general during the learn- 
ing process. In fact, evaluating each new rule by computing the answer set for it leads to 
this problem. Thus we are doing research on using more efficient proof procedures than 
SLD-Resolution, clustering of example description rules and extending BFOIL with 
specification operators to minimize this problem. To summarize the capabilities: LIPX 
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can learn single and multi slot wrappers for HTML or XML documents. It can handle 
slot fillers occurring in varying orders in the texts and it can handle slots that may he 
empty, missing or nested. Though the presented approach shows very promising results 
its runtime behavior is a major subject for improvement. Nevertheless the pure logic 
programming motivated and based technique to learn multi-slot wrappers, the general 
method of Igg operations for learning and its independency of the application domain, 
are auspicious properties. 
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Abstract. This paper describes a method designed for data mining ap- 
plications where the main goal is to predict extreme and rare values of 
a continuous target variable, as well as to understand under which con- 
ditions these values occur. Our objective is to induce models that are 
accurate at predicting these outliers but are also interpretable from the 
user perspective. We describe a new splitting criterion for regression trees 
that enables the induction of trees achieving these goals. We evaluate our 
proposal on several real world problems and contrast the obtained mod- 
els with standard regression trees. The results of this evaluation show 
the clear advantage of our proposal in terms of the evaluation statistics 
that are relevant for these applications. 

1 Introduction 

The work described in this paper addresses applications where the main ob- 
jective is to model rare extreme values, usually known as outliers. Given that 
the target variable is continuous we are facing regression problems. However, 
the main difference to standard regression tasks is that our main interest is to 
predict accurately the occurrences of rare high or low values of the target vari- 
able. A typical real world application is the prediction of stock market returns, 
where small and highly frequent returns are irrelevant for investors, while large 
movements of the market are the key events where accurate prediction pays off. 
Our interest is not only to anticipate the occurrence of an extreme value but 
also to be accurate at predicting its concrete value, because the amplitude of 
the outlier is relevant for the user of these applications, as it may lead to dif- 
ferentiated actions. Another major requirement of our target applications is the 
interpretability of the models. This means that discovering the conditions that 
lead to these extreme values is also a major goal of our models. 

Applications where the main modeling objective are rare events abound in 
recent data mining literature. Nevertheless, existing related work is mostly fo- 
cused on discrete target variables (i.e. classification tasks). These works include 
topics like activity monitoring [4], prediction of rare events [17,18], anticipation 
of surprising patterns [7], novelty detection, anomaly detection, among others. 
Most of this research is also linked to applications where a data stream is being 
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monitored with the goal of anticipating rare events, that is time-dependent data. 
This research is usually focused in the task of distinguish between interesting 
cases and “normal” occurrences. 

The importance and impact of rare cases has been the topic of research 
on small disjuncts (e.g. [6,19]). This research is again mainly focused on clas- 
sification tasks and is also strongly related to the study of applications with 
unbalanced class distributions (e.g. [5]). 

A frequent strategy to bias the models towards being accurate in particular 
types of cases is the use of differentiated misclassification costs (e.g. [16]). This is 
a common practice in classification tasks and was also used in solving regression 
problems through a classification approach [15]. 

All these classification approaches do not solve the problem of being able to 
accurately predict the specific value of outliers, and are particularly inadequate 
when these spread over a wide range of values. If the amplitude of the extreme 
values is relevant for the user, for instance for taking different actions, all these 
approaches based on classification are not applicable. Obviously, one could fur- 
ther divide the classes representing the extreme values into more specific classes 
to differentiate their importance but that would mean that we would partition 
an already low populated class into several classes, thus making our modeling 
task even more difficult. As such, for this kind of applications only a regression 
model can handle the problem properly. 

Buja and Lee [2] have recently presented a series of new splitting criteria for 
both classification and regression trees that address related problems. Regarding 
regression, they propose two different splitting criteria with two objectives: iden- 
tifying extreme buckets of the data; and identifying pure (low variance) buckets. 
The first objective is particularly related to ours. The goal of Buja and Lee is 
to identify areas of the regression surface where the target variable shows a high 
or low mean value. Although our goal is related to this, we are particularly in- 
terested in applications where these extreme values are rare, which demands for 
specific criteria. 

We propose a new splitting criterion for regression trees which enables the 
induction of models that meet our application requirements. In Section 2 we 
formalize our target problems and propose evaluation criteria that should guide 
the search for the best models. Section 3 describes the details of our proposal. 
The experimental evaluation of this proposal is presented in Section 4. We finish 
with the conclusions of this work and future research directions. 



2 Problem Formulation 

In this section we present a general description of our problem. Let I? be a 
data set, consisting of n cases {(x^, where is a vector of p discrete or 

continuous variables, and yi is a continuous target variable value. As we have 
mentioned before, we are interested in models that are able to predict accurately 
rare extreme values of Y. To achieve this goal we need to formalize the notion of 
rare extreme values. We use the statistical notion of outlier with this purpose. 
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Box plots are visualization tools that are often used to identify outliers. Extreme 
values are defined in these plots as values above or below the so-called adjacent 
values [3] . Let r be the interquartile range defined as the difference between the 
3rd and 1st quartiles of the target variable. The upper adjacent value, adjH, 
is defined as the largest observation that is less or equal to the 3rd quartile 
plus 1.5r. Equivalently, the lower adjacent value, adjT, is defined as the smallest 
observation that is greater or equal to the 1st quartile minus 1.5r. Given these 
two limits we can define our rare extreme values as, 

O = {y € D \ y > adjH V j/ < adji} 

Oh = {y € D \ y > adjn} (1) 

Ol = {y & D \ y < adjL} 

Depending on the application we may have either or Oh empty^. Figure 1 
shows the box plots of the targets in two applications where we have different 
types of outliers. These values are drawn with circles in these graphs. 





a) b) 

Fig. 1. Two example box plots with different types of extreme values: a) The relative 
performance of a set of CPUs; b) The 3-days returns of IBM closing prices. 



Having described the main features of our target applications we need to 
define some evaluation criteria to guide the search for the best models. Typical 
performance measures used in regression settings, such as the mean squared 
error, are inadequate as they do not stress the fact that we are only interested 
in the performance in extreme values. This is the same kind of phenomenon as 

^ We will discard applications where both sets are empty as these are not relevant for 
this study. 
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the one reported regarding the use of classification accuracy on problems with 
unbalanced class distributions [8,10]. 

In the information retrieval literature (e.g. [9]) the notion of relevance seems 
particularly adequate to our needs. Relevance is defined as the value or utility 
of a system output as a result of a user search. Relevance is most of the times 
assessed using two measures: precision and recall. Precision is defined as the 
proportion of the cases predicted as target events that really are target events. 
Recall is defined as the proportion of existing target events that are captured by 
the model. Our proposal consists of adapting these two measures to our problem 
setup with the goal of developing a learning tool that maximizes the relevance 
of the induced model to our application goals. 

We define recall in the context of our target applications as the proportion 
of outliers in our data that are predicted as such (i.e. covered) by our model, 

_ \ {y &Yq \ {y & Oh /\ y> adjH) V {y € Ol /\y < adjL)} \ ^. 2 ) 



where Yq is the set of y predictions of the model for the outlier cases (i.e. O). 
With respect to precision, if we use its standard definition we have. 



precision stand = 



\ {y ^Yo \ {y & Oh ^ y> adjn) V {y G Ol Ay < acljr)} \ 



\ {y GY \ y < adjL V y > adjn} \ 



( 3 ) 



where Y is the set of y predictions of the model. 

However, this definition is not adequate to our goals. For instance, with this 
formulation, assuming adjn = 5.6, a predicted value of 5.8 would have the same 
value as a prediction of 10.1, for a test case where the true value is 10.5. In our 
applications this is not acceptable. Otherwise, the best solution would probably 
be to discretize the target variable and handle the problem as a classification 
task with differentiated misclassification costs. As we want to distinguish this 
kind of errors we need to use another definition of precision (precisimiregr) that 
takes into account the distance between the predicted and true values. At the 
same time we want to maintain the scale of the measure within the 0..1 interval 
so that we are able to integrate recall and precision into a single measure using 
standard approaches. Our proposed definition of precisionregr is the following. 



precisioUregr = 1 ~ NMSEq 



( 4 ) 



where NMSEq is the normalized squared error of the model for the outliers. 



yi6 O 

E 



Vi&o 



NMSEq 



( 5 ) 
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The value of NMSEq will usually be between 0 and 1. For the cases where 
this value goes above 1, which means that the model is performing worse than 
the naive average model, we consider that the precision of the model is 0. 

Obtaining an overall evaluation measure from the values of recall and preci- 
sion provides a global preference criterion that can be used to guide the search 
for the models. The F-measure [11] is among the most used measures and is 
defined as, 

^ + l) • precision ■ recall 

/3^ • precision + recall 

where [3 controls the relative importance of recall to precision. This is the defi- 
nition we use replacing precision by our proposed precisionregr- 



3 An Approach Using Regression Trees 



Regression trees are known for their computational efficiency, model interpreta- 
bility and competitive accuracy. For these reasons we have decided to use these 
models as the base paradigm behind our proposal. 

Standard regression trees are obtained using a procedure that minimizes the 
squared error. This means that the best splits for each tree node are chosen to 
minimize the weighed squared error between the two branches. As mentioned 
by Buja and Lee [2] this criterion is not adequate for several data mining appli- 
cations. That is also the case of our target problems. Moreover, outliers can be 
a problem for standard regression trees as they may distort the selection of the 
best splits and may also have a large impact on the average values chosen for 
the leaves of the trees [14]. 

The main idea of our proposal to avoid the problems reported above is to use 
the F-measure presented in Equation (6) to guide the split selection procedure 
used to grow the trees. As such, the key distinguishing feature of our method is 
the criterion used to select the best test for each tree node. In our proposal the 
best split s*, is chosen using the following criterion, 

s*(A) = max max(F(Dti), A(A„)) (J) 

s ^ S 

where S is the set of trial splits for the node t Dt^ is the subset of cases in t 
(Dt) that satisfy the test s (i.e. the left sub-branch of t), while Dt^ contains the 
remaining cases (i.e. Dt^ = Dt — Dt^); and F{D) is the F-measure for a set of 
cases. 

In order to obtain the F-measure for the branches of a candidate split we 
need to obtain the values of precision and recall, which we do using the following 
formulas. 



That are the same as in a standard regression tree. 



2 
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precisioriregrt = 



~ 

_ j/jg OniPt) 

E (Y-y,y 

Vi^ OniDt) 

E “ 2/*)^ 

-I Vie OL(Dt) 

E (y-yy' 

Vie Oh^Dt) 



if yt > adjn V yt > jjt 



(8) 



if yt < adjL V yt < yt 



where OniDt) {OL{Dt)) is the set of cases of node t that belong to Oh{Ol)', Vt 
is the average Y value in the node; ijt is the median Y of the node; and Y is the 
average Y in the training data. 

This means that depending on the value of the node average we consider this 
branch as a tentative to predict high or low outliers, and calculate its precision 
accordingly. Even if the node average is not in the outlier range of values we still 
calculate the precision in the node, using the median as a threshold for deciding 
whether to calculate it with respect to high or low outliers. 

Regarding recall we use. 



ro 



recallt = < 



I veDt A v€Oh I 
I Oh I 



I yePt A y<eOL \ 
I Ol I 



if yt > adjL /\ yt< adjn 
if yt > adjn 
if yt < adjL 



(9) 



When a trial split leads to a branch having an average target value that is 
not an outlier, the respective recall is zero. This would lead to an F value of 
zero according to Equation (6). This is a common situation particularly in top 
level nodes, where the partitions are still too big, and thus the average Y is 
seldom an outlier. Moreover, sometimes all trial splits for a node are in these 
circumstances. This means that we are not able to select the best split for these 
nodes as all splits have the same score, and thus the tree growth procedure 
would stop prematurely. These situations occur because in complex applications 
we seldom find a single split that is able to isolate extreme values in one of 
the branches so that the branch has an average target that is an outlier. This 
problem decreases as the tree grows because the number of cases in the nodes 
gets smaller and thus finding such splits is easier. Although these top level splits 
have zero recall we should still be able to establish a preference criterion to select 
one, because we can calculate their precision. In order to overcome this difficulty 
we have added a small threshold^ to the value of recall in Equation (6) so that 
the value of F is not zero even when the recall is null. 

Summarizing, our proposal consists of selecting the splits that are able to 
generate a branch (a subset of cases) with a high value of the F-measure. No- 
tice, that we do not search for a weighted solution between the two branches. 

® We have used the value of 0.001 in our experiments. 
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Even if one of the branches as a poor F score, as long as the other achieves a 
high F-measure we have a good candidate split. This strategy is similar to the 
one followed by Buja and Lee [2], which also do not search for splits with a 
good compromise between the left and right branches. These strategies lead to 
unbalanced trees. Still, we share the opinion of Buja and Lee that consider these 
trees more interpretable. 

Another important question that needs to be addressed when developing 
a tree-based system, is the tree growth stopping criteria. This is a statistical 
estimation problem and most systems use a two-stages procedure consisting of 
growing an overly large tree (possibly overfitting the training data), and then 
use some statistical estimation procedure (e.g. cross validation) for post-pruning 
this tree^. Given that outliers are insignificant from a statistical perspective, 
these strategies are difficult to implement in our system because they are based 
on statistical significance. Because of this we have decided not to post-prune our 
trees. This is consistent with what is mentioned by Weiss and Hirsh [19] in the 
context of learning from small disjuncts. These authors mention that pruning is 
considered questionable when the learning objectives are small subsets of cases. 

Our method obtains a tree model in a single stage, stopping the tree growth 
when one of the following conditions arise: 

— The F-measure of the node is above a certain user-definable threshold, 

— Or the node does not contain any extreme value (i.e. Dt (1 O = </>). 

In order to illustrate the effects of using the proposed splitting criteria as 
opposed to standard least squares methods, we describe a small example ap- 
plication. Due to space reasons we have chosen a dataset that leads to small 
trees. We have used the well-known CPU performance dataset. In this domain 
the task is to predict the relative performance of a set of CPUs given some hard- 
ware characteristics of these machines. The dataset has 23 high outlier values 
(values above 237, c.f. Figure la). Using a CART-alike regression tree® with a 
standard 1-SE cross validation pruning algorithm [1], we get the tree on the 
right-hand side of Figure 2. From the point of view of outliers this tree isolates 
two classes of outliers, both formed by machines with a maximum main memory 
size above 28000Kb: One class is less extreme in terms of performance (average 
performance of 299) and includes machines with cache size below 80Kb; and 
the other class contains machines with larger cache size that have higher per- 
formance (average of 667). According to this tree, all computers with less than 
28000Kb memory have low performance. Still, there are three exceptions to this 
(the numbers between parentheses on each node are the number of outliers in 
that node) that are neglected by this tree. The solution of our model is given 
at the left hand-side of Figure 2. Our tree is much more specific in terms of 
describing the conditions leading to outliers. Moreover, it further distinguishes 

^ See [13] for an overview of pruning methods for regression trees. 

® In this paper we have used as base implementation of regression trees the package 
rpart [12] of the open source statistical software R (www.r-project.org). This package 
is a close re-implementation of most of CART’s [1] features. 
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the type of outliers. Namely, we can identify even more extreme performance 
machines that have a high main memory size (above 48000Kb). Our tree also 
describes the machines with an outlier performance that have less than 28000Kb 
memory. This tree is clearly more consistent with the distribution of the outliers 
(i.e. the type of machines with high performance), as it can be seen from the box 
plot of the target variable presented in Figure la. Although one may think that 
this tree could be simply overfitting the data, the fact is that as we will see on 
the results of our experiments for this domain, our models achieve a significantly 
higher precision, recall and F-value. 

In summary, from the perspective of understanding the type of extreme values 
occurring in this domain, and also under which conditions these appear, we claim 
that our tree is more informative than a standard regression tree. Moreover, this 
higher interpretability is accompanied by better accuracy as it will be shown in 
Section 4. 





Fig. 2. Our regression tree vs the tree obtained by a C ART-alike system, on the ma- 
chine CPU dataset. 



4 Experimental Results 

In this section we perform an experimental analysis of the trees obtained with 
our method. Our analysis compares our proposal to its base paradigm, standard 
regression trees. 

We have carried out a series of experiments using the datasets described in 
Table 1. These datasets include applications obtained from standard repositories 
as well as some commercial applications. 

















Predicting Outliers 455 



Table 1. Datasets description. 



Datasets 


cases 


continuous 

attr. 


nominal 

attr. 


^ outliers 


low 

outliers 


high 

outliers 


servo 


167 


0 


4 


30 


0 


30 


triazines 


186 


60 


0 


9 


9 


0 


algae 1 


200 


8 


3 


12 


0 


12 


algae2 


200 


8 


3 


10 


0 


10 


algaeS 


200 


8 


3 


22 


0 


22 


algae4 


200 


8 


3 


16 


0 


16 


algae5 


200 


8 


3 


13 


0 


13 


algaeG 


200 


8 


3 


19 


0 


19 


algae? 


200 


8 


3 


21 


0 


21 


machine_cpu 


209 


6 


0 


23 


0 


23 


china 


217 


9 


0 


19 


0 


19 


Boston 


506 


13 


0 


37 


0 


37 


onekm 


710 


14 


3 


8 


6 


2 


cw.drag 


1449 


12 


2 


52 


1 


51 


co2. emission 


1558 


19 


8 


23 


0 


23 


acceleration 


1732 


11 


3 


26 


0 


26 


available. power 


1802 


7 


8 


121 


0 


121 


bankSFM 


4499 


8 


0 


69 


0 


69 


delta. ailerons 


7129 


5 


0 


107 


41 


66 


ibm 


8166 


10 


0 


325 


140 


185 


cpu. small 


8192 


12 


0 


430 


430 


0 


delta. elevators 


9517 


6 


0 


132 


60 


72 


cal. housing 


20460 


8 


0 


1071 


0 


1071 


add 


30000 


10 


0 


63 


0 


63 


fried, delve 


40768 


10 


0 


25 


6 


19 



We have carried out 5 repetitions of 10- fold cross validation experiments using 
these datasets. These experiments were designed with the goal of estimating 
the average difference in precision, recall, and F-measure, between a standard 
regression tree and our proposed method. For the standard method we have 
used the package rpart of R, using cross validation error-complexity pruning 
with the 1-SE rule according to the method in [1]. Regarding our method we 
have used a F-value of 0.7 as threshold for deciding when to stop tree growth. The 
statistical significance of the observed differences was asserted through paired 
t-tests. Differences that are significant at the 95% level were marked with one 
sign, while differences significant at 99% have two signs. Plus (-I-) signs are used 
to mark differences favorable to standard regression trees, while minus (— ) signs 
are used to indicate the significant wins of our method. Differences that are not 
significant at these confidence levels have no sign. The F-measure of each method 
was calculated with /3 = 1, meaning that the same weight was given to precision 
and recall (c.f. Equation (6)). 

The results of our experiments are shown on Table 2. This table shows an 
overwhelming advantage of our method at least from the perspective of the F- 
measure, which was the criterion used to grow our trees. In effect, in the 25 
datasets there were 18 significant wins, 3 insignificant wins and 4 insignificant 
losses of our proposal. The advantage is even more remarkable in terms of the 
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proportion of outliers in the domain that are captured by the model (i.e. the 
recall). However, the results in terms of precision are not so interesting. We have 
tried to understand the reasons for this lack of precision in some domains. We 
have varied the F threshold that guides the criterion for stopping tree growth 
and have observed some variations on these results that seem to indicate that 
there is some space for improvement of our method by tunning this parameter. 
Apparently, tree growth may be stopping too soon for these large datasets, where 
our performance seems to be degrading. Still, if precision was the key objective 
we could also tune the [3 parameter of the F-measure that weights the preference 
between recall and precision when selecting the best splits of the trees®. 

Some of the results given in Table 2 deserve further explanations. Given the 
definition of precision we use (Equations (4) and (5)) it may seem strange to see 
some zero values in precision. These occur because some models have a NMSE 
at predicting the outliers equal or above one. Namely, for several datasets the 
CART tree is simply a single leaf node, which leads to a NMSE of one, and thus 
a precision of zero. The values of zero recall are consequence of models that do 
not predict any of the outliers as such, which occurs when a tree does not have 
any leaf with an average value that is an outlier. 

Summarizing, the results of these experiments clearly show the advantage of 
our proposal in terms of predicting outliers. Nevertheless, we think some space 
is left for improvements particularly in terms of tunning the system by changing 
the stopping criterion as well as the weight between precision and recall. For 
large datasets, the best solution would probably be to keep a holdout set for 
proper tuning of these parameters. 



5 Conclusions 

We have described a new splitting criteria for regression trees with the goal of ad- 
dressing a specific class of data mining applications. In these domains the main 
goal of modeling is to predict accurately outlier values in the target variable 
and also to understand under which conditions these values occur. Our pro- 
posal addresses these application goals by leading to regression trees designed to 
maximize both the number of outliers that are captured by the model and the 
precision at predicting their values. 

The resulting trees were shown to achieve our goals in an extensive experi- 
mental comparison using 25 domains. In these experiments we have compared 
our approach to a standard regression tree and concluded that our proposal 
clearly outperforms these trees regarding the evaluation criteria that are ade- 
quate for this type of applications. 

Regarding future work we plan to investigate more deeply the reasons for the 
failure of our models in terms of precision in some of the domains. Our current 
explanation lies on the tree growth stopping criteria and we intend to explore 
other alternatives to the current user settable threshold on the F-measure value. 



In our experiments we used equal weight. 
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Table 2. Regression trees vs our method in terms of Precision, Recall and F measure. 



Datasets 


Precisionregr 


Recall 


F measure I 


CART 


Our 

method 


Signif 


CART 


Our 

method 


Signif 


CART 


Our 

method 


Signif 


servo 


0.7616598 


0.7829856 




0.8713333 


0.9800000 


— 


0.8053263 


0.8668980 


- 


triazines 


0.1474294 


0.3113946 


- 


0.0400000 


0.2233333 


- 


0.0371048 


0.2286451 


- 


algae 1 


0.2947144 


0.4377268 


— 


0.0000000 


0.3266667 


— 


0.0000000 


0.3303920 


— 


algae2 


0.0000000 


0.1394068 


— 


0.0000000 


0.0700000 


- 


0.0000000 


0.0483434 




algaeS 


0.0000000 


0.0994022 


- 


0.0000000 


0.1820000 


— 


0.0000000 


0.0766102 


- 


algae4 


0.0000000 


0.1034622 


— 


0.0000000 


0.1416667 


— 


0.0000000 


0.0875694 


- 


algaeS 


0.0000000 


0.1286299 


— 


0.0000000 


0.0673333 


- 


0.0000000 


0.0569739 


- 


algaeG 


0.0000000 


0.0481409 


- 


0.0000000 


0.1600000 


— 


0.0000000 


0.0383576 


- 


algaeT 


0.0127059 


0.0983871 


- 


0.0100000 


0.1336667 


- 


0.0111917 


0.0964774 


- 


machine. cpu 


0.5517675 


0.6879704 


— 


0.8186667 


0.8950000 


- 


0.6266528 


0.7596690 


— 


china 


0.0000000 


0.0740538 


- 


0.0000000 


0.0706667 


- 


0.0000000 


0.0621120 


- 


Boston 


0.8243590 


0.8225584 




0.7580000 


0.7595000 




0.7675711 


0.7361953 




onekm 


0.4001506 


0.3350779 




0.0166667 


0.2600000 


— 


0.0059831 


0.2701772 


— 


cw.drag 


0.9250750 


0.8269861 


+-I- 


0.8419906 


0.9656190 


— 


0.8734481 


0.8864179 




co2. emission 


0.8052664 


0.8100812 




0.4213333 


0.5826667 


- 


0.4770925 


0.6368632 


- 


acceleration 


0.8964751 


0.9010154 




0.5600000 


0.8210000 


— 


0.6287861 


0.8265599 


— 


available. power 


0.9668409 


0.8567897 


+-I- 


0.9091224 


1.0000000 


— 


0.9353684 


0.9217586 




bankSFM 


0.9781688 


0.9529205 


+ 


0.6532389 


0.5876751 




0.7592038 


0.6971588 




delta.ailerons 


0.6442916 


0.5902405 


+-I- 


0.1293077 


0.1895810 


- 


0.1659074 


0.2671672 


- 


ibm 


0.0000000 


0.0100769 


— 


0.0000000 


0.0055804 


- 


0.0000000 


0.0038138 


- 


cpu. small 


0.9939290 


0.9856554 


+-I- 


0.8519882 


0.8620939 




0.9165213 


0.9189269 




delta.elevators 


0.5980803 


0.6054747 




0.0419848 


0.1734476 


— 


0.0719734 


0.2595248 


— 


cal. housing 


0.8425739 


0.5854252 


+-I- 


0.3140477 


0.4403454 


— 


0.4561053 


0.5016178 


— 


add 


0.9320413 


0.7925288 


+-I- 


0.0000000 


0.1496825 


— 


0.0000000 


0.2125705 


— 


fried. delve 


0.8816233 


0.5351138 


+-I- 


0.0735714 


0.0700794 




0.0959500 


0.0913309 
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Abstract. One of the most important features of expert reasoning is 
that each reasoning rule may be composed of several diagnostic steps, 
usually hierarchical differential diagnosis. For example, medical diagnosis 
include hierarchical diagnostic steps In this paper, the characteristics of 
experts’ rules are closely examined from the viewpoint of hiearchical de- 
cision steps and a new approach to extract plausible rules is introduced, 
which consists of the following three procedures. First, the characteriza- 
tion of decision attributes (given classes) is extracted from databases and 
the concept hierarchy for given classes is calculated. Second, based on the 
hierarchy, rules for each hierarchical level are induced from data. Then, 
for each given class, rules for all the hierarchical levels are integrated into 
one rule. The proposed method was evaluated on medical databases, the 
experimental results of which show that induced rules correctly represent 
experts’ decision processes. 



1 Introduction 

One of the most important problems in data mining is that extracted rules are 
not easy for domain experts to interpret. One of its reasons is that conventional 
rule induction methods [7] cannot extract rules, which plausibly represent ex- 
perts’ decision processes [9]: the description length of induced rules is too short, 
compared with the experts’ rules. For example, rule induction methods, includ- 
ing AQ15[4] and PRIMEROSE[9], induce the following common rule for muscle 
contraction headache from databases on differential diagnosis of headache: 

[location = whole] A [Jolt Headache = no] A [Tenderness of Ml = yes] 

— >■ muscle contraction headache. 

This rule is shorter than the following rule given by medical experts. 

[Jolt Headache = no] 

A([Tenderness of MO = yes] V[Tenderness of Ml = yes] V [Tenderness of M2 = yes]) 

A [Tenderness of B1 = no] A [Tenderness of B2 = no] A [Tenderness of B3 = no] 

A [Tenderness of Cl = no] A [Tenderness of C2 = no] A [Tenderness of C3 = no] 
[Tenderness of C4 = no] — >■ muscle contraction headache 

where [Tenderness of B1 = no] and [Tenderness of Cl = no] are added. 



N. Lavrac et al. (Eds.): PKDD 2003, LNAI 2838, pp. 459-470, 2003. 
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These results suggest that conventional rule induction methods do not reflect 
a mechanism of knowledge acquisition of medical experts. 

In this paper, the characteristics of experts’ rules are closely examined and 
a new approach to extract plausible rules is introduced, which consists of the 
following three procedures. First, the characterization of each decision attribute 
(a given class), a list of attribute- value pairs the supporting set of which covers 
all the samples of the class, is extracted from databases and the classes are 
classified into several groups with respect to the characterization. Then, two 
kinds of sub-rules, rules discriminating between each group and rules classifying 
each class in the group are induced. Finally, those two parts are integrated into 
one rule for each decision attribute. 

The paper is organized as follows. Section 2 discusses the background of this 
study. Section 3 and 4 introduces rough sets and a characterization set. Section 
5 gives an algorithm for rule induction. Section 6 shows an illustrative example 
and Section 7 discusses the results. Finally, Section 8 concludes this paper. 

2 Background: Problems with Rule Induction 

As shown in the introduction, rules acquired from medical experts are much 
longer than those induced from databases the decision attributes of which are 
given by the same experts. This is because rule induction methods generally 
search for shorter rules. One of the main reasons why rules are short is that 
these patterns are generated only by one criteria, such as high accuracy or high 
information gain. The comparative studies [9, 10] suggest that experts should ac- 
quire rules not only by one criteria but by the usage of several measures. Those 
characteristics of medical experts’ rules are fully examined not by comparing be- 
tween those rules for the same class, but by comparing experts’ rules with those 
for another class[9]. For example, the classification rule for muscle contraction 
headache given in Section 1 is very similar to the following classification rule for 
disease of cervical spine: 

[Jolt Headache = no] 

A ([Tenderness of MO = yes] V [Tenderness of Ml = yes] V [Tenderness of M2 = yes]) 

A([Tenderness of B1 = yes] V [Tenderness of B2 = yes] V [Tenderness of B3 = yes] 
V[Tenderness of Cl = yes] V[Tenderness of C2 = yes] V[Tenderness of C3 = yes] 

V [Tenderness of C4 = yes]) — >■ disease of cervical spine 

The differences between these two rules are attribute- value pairs, from tenderness 
of B1 to C4. Thus, these two rules can be simplified into the following form: 

Ai A A2 A -'A3 — >■ muscle contraction headache 
Ai A A2 A A3 — >■ disease of cervical spine, 

where Ai, A2 and A3 are given as the following formulae: 

Ai = [Jolt Headache = no], A2 = [Tenderness of MO = yes] V [Tenderness of 
Ml = yes] V [Tenderness of M2 = yes], and A3 = [Tenderness of Cl = no] A 
[Tenderness of C2 = no] A [Tenderness of C3 = no] A [Tenderness of C4 = no]. 
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The first two blocks { A\ and A 2 ) and the third one ( ^3 ) represent the different 
types of differential diagnosis. The first one A\ shows the discrimination between 
muscular type and vascular type of headache. Then, the second part shows that 
between headache caused by neck and head muscles. Finally, the third formula 
A 3 is used to make a differential diagnosis between muscle contraction headache 
and disease of cervical spine. Thus, medical experts first select several diagnostic 
candidates, which are very similar to each other, from many diseases and then 
make a final diagnosis from those candidates. 

This paper formalizes these procedures from the viewpoint of rough sets [5] 
and introduces a new approach to rule induction. 



3 Rough Set Theory and Probabilistic Rules 

In the following sections, we use the following notations introduced by Grzymala- 
Busse and Skowron[8], which are based on rough set theory [5]. These notations 
are illustrated by a small database shown in Table 1, collecting the patients who 
complained of headache. 

Let U denote a nonempty, finite set called the universe and A denote a 
nonempty, finite set of attributes, i.e., a : U ^ Va ior a £ A, where Va is called 
the domain of a, respectively.Then, a decision table is defined as an information 
system, IS = ([/, AU {d}). For example. Table 1 is an information system with 
U = {1, 2, 3, 4, 5, 6} and A = {age, location, nature, prodrome, nausea, Ml} and 
d = class. For location € A, Viocation is defined as {occular, lateral, whole}. 

The atomic formulae over B C A U {d} and V are expressions of the form 
[a = -u], called descriptors over B, where a £ B and v £ Va. The set F{B,V) of 
formulas over B is the least set containing all atomic formulas over B and closed 
with respect to disjunction, conjunction and negation. For example, [location = 
occular] is a descriptor of B. 

For each / £ F(B,V), /a denote the meaning of / in A, i.e., the set of all 
objects in U with property /, defined inductively as follows. 



1. If / is of the form [a = v] then, /^ = {s g [/|a(s) = u} 

2- (/ A p)a = /a n gA; if V g)A = gA\ hf)A = U - fa 

For example, / = [location = occular] and fA = {1,5, 6, 7}. As an example 
of a conjunctive formula, g = [location = occular] A[nausea = no] is a descriptor 
of U and gA is equal to {1,5}. 

It is also notable that d can be treated as a formula (or an attribute-value 
pair) because BsubseteqA is extended into BsubseteqAU d and d has the same 
nature as an atttribute a £ A: that is, since d is of the form [d = classi], 
dA = {s £ [7|d(s) = clasSi}. For simplicity, dA is denoted by D in subsequent 
sections. 

By the use of the framework above, classification accuracy and coverage, or 
true positive rate is defined as follows. 
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Table 1. A small example of a database 



No. 


loc nat 


his 


prod jolt 


nau 


Ml M2 


class 


1 


occular per 


per 


0 


0 


0 


1 


1 


m.c.h. 


2 


whole per 


per 


0 


0 


0 


1 


1 


m.c.h. 


3 


lateral thr 


par 


0 


1 


1 


0 


0 


common. 


4 


lateral thr 


par 


1 


1 


1 


0 


0 


classic. 


5 


occular per 


per 


0 


0 


0 


1 


1 


psycho. 


6 


occular per 


subacute 


0 


1 


1 


0 


0 


i.m.L 


7 


occular per 


acute 


0 


1 


1 


0 


0 


psycho. 


8 


whole per 


chronic 


0 


0 


0 


0 


0 


i.m.L 


9 


lateral thr 


per 


0 


1 


1 


0 


0 


common. 


10 


whole per 


per 


0 


0 


0 


1 


1 


m.c.h. 



Definition, loc: location, nat: nature, hisihistory. 

Definition, prod: prodrome, nan: nausea, jolt: Jolt headache. 

Ml, M2: tenderness of Ml and M2, 1: Yes, 0: No, per: persistent, 
thr: throbbing, par: paroxysmal, m.c.h.: muscle contraction headache, 
psycho.: psychogenic pain, i.m.L: intracranial mass lesion, common.: 
common migraine, and classic.: classical migraine. 



Definition 1. 

Let R and D denote a formula in F{B, V) and a meaning of a decision d. 
Classification accuracy and coverage(true positive rate) for R ^ d is defined as: 



o:r{D) = kr{D) 



IRaCD] 

\D\ 



where |5'|, aR^D), kr{D) denote the cardinality of a set S, a classification ac- 
curacy of R as to classification of D and coverage (a true positive rate of R to 
D), respectively. 

In the above example, when R and D are set to [nau = 1] and [cZoss = common], 
ur{D) = 2/5 = 0.4 and kr{D) = 2/2 = 1.0. 

It is notable that aR{D) measures the degree of the sufficiency of a proposi- 
tion, R^ D, and that kr{D) measures the degree of its necessity. For example, 
if aR{D) is equal to 1.0, then i? — i D is true. On the other hand, if kr{D) is 
equal to 1.0, then D — i i? is true. Thus, if both measures are 1.0, then R ^ D. 
Finally, we define partial order of equivalence as follows: 

Definition 2. Let Ri and Rj be the formulae in F{B, V) and let A{Ri) denote 
a set whose elements are the attribute-value pairs of the form [a, ri] included in 
Ri- If A{Ri) C A{Rj), then we represent this relation as: 



Ri ^ Rj. 

According to the definitions, probabilistic rules with high accuracy and cov- 
erage are defined as: 

R‘^ d s.t. R = ViRi = V Aj [uj = Vk], aR,{D) > 5a and kr,{D) > 5^, 
where 5a and denote given thresholds for accuracy and coverage, respectively. 
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4 Characterization Sets 

4.1 Characterization Sets 

In order to model medical reasoning, a statistical measure, coverage defined 
ins Section 2 plays an important role in modeling, which is equivalent to a 
conditional probability of a condition (R) under the decision (D): P{R\D). Let 
us define a characterization set of D, denoted by L{D) as a set, each element of 
which is an elementary attribute-value pair R with coverage being larger than a 
given threshold, That is. 

Definition 3. Let R denote a formula in F{B,V). Characterization sets of a 
target concept (D) is defined as: 

Ls^{D) = {R\kr{D) > S^} 

Then, three types of relations between characterization sets can be defined as 
follows: 

Independent type: Ls^{Di) fl Lg^{Dj) = 

Boundary type: Ls^{Di) fl Ls^{Dj) yf </>, and 

Subcategory type: Ls^Df) C Ls^{Dj). 

All three definitions correspond to the negative region, boundary region, and 
positive region, respectively, if a set of the whole elementary attribute- value 
pairs will be taken as the universe of discourse. 

Tsumoto focuses on the subcategory type in [10] because Di and Dj cannot be 
differentiated by using the characterization set of Dj, which suggests that Di is 
a generalized disease of Dj. Then, Tsumoto generalizes the above rule induction 
method into the overlapped type, considering rough inclusion [11]. However, both 
studies assumes two- level diagnostic steps: focusing mechanism and differential 
diagnosis, where the former selects diagnostic candidates from the whole classes 
and the latter makes a differential diagnosis between the focused classes. 

The proposed method below extends these methods into multi-level steps. 



4.2 Characteristics 

We consider the special case of characterization sets in which the thresholds of 
coverage is equal to 1.0. That is, 

L^,o{D) = {R,\rr,{D) = 1.0} 

Then, we have several interesting characteristics. 

Theorem 1. Let Ri and Rj two formulae in Li q{D) such that Ri ^ Rj. Then, 

OtR, < Ur. . 

Thus, when we collect the formulae whose values of coverage are equal to 1.0, 
the sequence of conjunctive formulae corresponds to the sequence of increasing 
chain of accuracies. 
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For example, [not = per] and [his = per] are elements of Lixi{m.c.h.) and 
those accuracies are: 3/7 and 3/5. Then, since the meaning of {[loc= occular] V 
[loc = whole]) A [his = per] is equal to [1,2,5,10], the accuracy of [nat = 
per] A [his = per] is 3/4. 

Since kr{D) = 1.0 means that the meaning of R covers all the samples of D, 
its complement U — Ra, that is, -•R do not cover any samples of D. Especially, 
when R consists of the formulae with the same attributes, it can be viewed as 
the generation of the coarsest partitions. Thus, 

Theorem 2. Let R be a formula in Lix,{D) such that R = Vj[aj = Vj]. Then, 
R and ~<R gives the coarsest partition for Oj, whose R includes D. 

From the propositions 1 and 2, the next theorem holds. 

Theorem 3. Let A consist of {oi, 02 , • • • , a„} and Ri be a formula in Li,o{D) 
such that Ri = \/j[ai = Vj]. Then, a sequence of a conjunctive formula F{k) = 
gives a sequence which increases the accuracy. □ 

5 Rule Induction with Grouping 

As discussed in Section 2, When the coverage of R for a target concept D is 
equal to 1.0, i? is a necessity condition of D. That is, a proposition D ^ R holds 
and its contrapositive -•R — >■ -•D holds. Thus, if R is not observed, D cannot 
be a candidate of a target concept. Thus, if two target concepts have a common 
formula R whose coverage is equal to 1.0, then -•R supports the negation of 
two concepts, which means these two concepts belong to the same group. Fur- 
thermore, if two target concepts have similar formulae Ri,Rj G Li,q(D), they 
are very close to each other with respect to the negation of two concepts. In 
this case, the attribute-value pairs in the intersection of Li,o{Di) and Li,o{Dj) 
give a characterization set of the concept that unifies Di and Dj, Dj^. Then, 
compared with and other target concepts, classification rules for can be 
obtained. When we have a sequence of grouping, classification rules for a given 
target concepts are defined as a sequence of subrules. From these ideas, a rule 
induction algorithm with grouping target concepts can be described as Figure 1. 
This algorithm first calculates Li,o{Di) for {I?i, H 2 , • • • , Dk}. Second, from the 
list of characterization sets, it calculates the intersection between Lix,{Di) and 
Li.o{Dj) and stores it into Lid- Third, the procedure calculates the similarity 
(matching number)of the intersections and sorts Lid with respect of the similar- 
ities. Fourth, the algorithm chooses one intersection (Di fl Dj) with maximum 
similarity (highest matching number) and group Di and Dj into a concept DDi. 
These procedures will be continued until all the grouping is considered (Fig. 2). 
Finally, rules for generated group and diseases are induced by using a rule in- 
duction algorithm shown in Fig. 3. 
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procedure Total Process', 

var inputs 

Ld '■ List', /* A list of Target Concepts */ 

begin 

Calculate a set of characterization set Lc, 
Calculate a set of intersection Lid', 

Calculate a list of similarity measures La', 
Calculate a list of grouping Lg', (Fig. 2) 
Induce a set of rules for Lg'. Lr', (Fig. 3) 
Combine Rules in Lr for each Di', 
end {Total Process} 



Fig. 1. An Algorithm for Total Process 



procedure Grouping ; 
var inputs 

Lc ' List-, 

/* A list of Characterization Sets */ 

Lid • List', 

/* A list of Intersection */ 

Ls : List-, 

/* A list of Similarity */ 

var outputs 

L gy ! LXSt', 

/* A list of Grouping */ 

var 

k : integer-, Lg,Lgr '■ List-, 
begin 

Ld :={} ; 

k := n 

/* n: A number of Target Concepts*/ 

Sort Ls with respect to similarities; 

Take a set of (Di, Dj), Lmax 

with maximum similarity values; 
k:^ k+1; 

forall (Di, Dj) G L^^nax do 

begin 

Group Di and Dj into Dk', 

Lc ■■= Lc - {{Di,Li.o{Di)}-, 

Lc := Lc — {{L)j, Li,o{Dj)}; 

Lc ■— Lc + Ti.o(Dfc)}; 

Update Lid for L)Dk'y 
Update Ls ; 

Lgr ■— ( 

Grouping for Lc, Lid, ^^nd Lg) ; 
Lg :=Lg-G{{(D^,Di,Dj),Lg}}; 

end 

return Lg-, 
end {Grouping} 



procedure Ruleinduction ; 
var inputs 
Lc ' List', 

/* A list of Characterization Sets */ 

Lid '• List', /* A list of Intersection */ 

Lg : List-, /* A list of grouping*/ 

/* Di, DP, {(DD„+2, .)...}}} V 

/* n: A number of Target Concepts */ 
var 

Q, Lr : List; 

begin 

Q -=Lg',Lr :={}; 

if (Q 0) then do 
begin 

Q '■= Q - first(Q)', 

Lr Rule Induction {Lc, Lid, Q); 

end 

{DDk, Di, Dj) first{Q); 
if {Di G Lc and Dj G Lc) then do 
begin 

Induce a Rule r which discriminate 
between Di and Dj; 
r = {Ri^ Di,Rj -> Dj}' 

end 
else do 
begin 

Search for Li.o(Di) from Lc', 

Search for L\,o{Dj) from Lc', 
if {i < j) then do 
begin 

r(Di) := \/ Rii^L,g,(Dj)^Rl ->■ ^Dj', 

r(Dj) ■■= a(Dj)Ri ->■ Dj', 

end 

r := {rlDi),rlDj)}' 

end 

return Lr {r,Lr} ; 
end {Rule Induction} 



Fig. 2. An Algorithm for 
Grouping 



Fig. 3. An Algorithm for Rule Induction 
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6 Example 

Let us consider Table 1 as an example for rule induction. For a similarity func- 
tion, we use a matching number [3] which is defined as the cardinality of the 
intersection of two the sets. Also, since Table 1 has five classes, k is set to 6. 

6.1 Grouping 

From this table, the characterization set for each concept is obtained as shown in 
Fig 4. Then, the intersection between two target concepts are calculated. Since 
common and classic have the maximum matching number, these two classes are 
grouped into one category, Dq. Then, teh characterization of Dq is obtained as : 
De = {[loc = lateral], [nat = thr], [jolt = 1], [nau = 1], [Ml = 0], [M2 = 0] 
from Fig 5. 

In the second iteration, the intersection of Di and others is considered as 
shown in Fig 6. From this matrix, we have two possibilities of grouping: one 
is to group m.c.h. and i.m.l. That is, these two diseases are grouped into Dj-. 
Dt = {{[loc = occular] V [loc = whole]), [nat = per], [prod =0]} The other one 
is to group Di and i.m.l., where Dy = {[jolt = 1], [Ml = 0], [M2 = 0]}. 

In the third iteration of the former case(3a), the intersection is calculated as 
Fig 7 and D 2 and psycho are grouped into H3: D^a = { [nat=per], [prod=0] } In 
the latter case(3f,), it is calculated as Fig 8 and m.c.h. and psycho are grouped 
into D^: Dsa = { [nat=per], [prod=0] }. Fig 9 and 10 depicts the two results of 
grouping like a dendrogram in clustering analysis [3]. 



Li,o{m.c.h.) = {{[loc = occular] V [loc = whole]), [nat = per], [his = per], 

[prod — 0[, [jolt — 0], [nau = 0[, [Ml = 1], [M2 = 1[} 
Li.o{common) = {[loc = lateral], [nat = thr], {[his = per] V [his = par]), [prod = 0[, 
[jolt = 1], [nau = 1[, [Ml = 0[, [M2 = 0[} 

Li.o{classic) = {[loc = lateral], [nat = thr], [his = par], [prod = 1[, 

[jolt — 1], [nau = 1[, [Ml = 0[, [M2 = 0[} 

Li,o{i.m.l.) = {{[loc = occular] V [loc = whole]), [nat = per], 

{[his = subacute] V [his = chronic]), [prod = 0[, 

[io;t=l],[Ml = l[,[M2 = l[} 

Li.o{psycho) = {[loc = occular], [nat = per], {[his = per] V [his = acute]), 

[prod = 0[} 



Fig. 4. Characterization Sets for Table 1 



6.2 Rule Induction 

Due to the limitation of space, we focus on rule induction based on the first 
model. Figure 9 shows one candidate of the differential diagnosis. For the differ- 
ential diagnosis of common. First, this model discriminate between DQ{common 
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common 

{[prod=0]} 



classic 
^ 



common — — { [loc=lateral] , [nat=thr] , [jolt = l] , 

[nau=l], [M1=0],[M2=0]} 
classic — — — 



i.m.l. 

{ ( [loc=occular] V [loc=whole] ) , 
{ [nat=per] , [prod=0] } 
{[prod=0],[jolt = l], 
[M1=0], [M2=0] } 

{[jolt = l],[Ml=0],[M2 = 0]} 



[prod=0]} 



psycho 

{ [nat=per] ,[prod = 0] } 
{[prod = 0]} 

{ } 

{ [nat = per] , 



Fig. 5. Intersection of Two Characterization Sets (Step 2) 





m.c.h. Dq 


i.m.l. 


psycho 


m.c.h. 


- {} 


{ ( [loc=occular] V [loc=whole] ) , 
{ [nat=per] , [prod=0] } 


{ [nat— per] , [prod— 0] } 


Dq 


— — 


{[jolt-1], [Ml-0], [M2-0]} 


{ } 


i.m.l. 


- - 


- 


{ [nat— per] , [prod— 0] } 



Fig. 6. Intersection of Two Characterization Sets after the first Grouping (Step 3) 





£>6 £>7 


psycho 




m.c.h. D-j 


psycho 


De 


- {} 


{ 1 


m.c.h. 


- 11 


{ [nat— per] , [prod— 0] } 


Dj 


- - 


{ [nat— per] , [prod— 0] } 


Dj 


- {} 


{ } 



Fig. 7 . Intersection of Two Characteriza- Fig. 8. Intersection of Two Characteriza- 
tion Sets after the first Grouping (1) (Step tion Sets after the first Grouping (2) (Step 
4a) 4b) 



and classic) and Ds {m.c.h., i.m.l. and psycho). Then, common and classic 
within Dq are differentiated. Thus, a classification rule for common is composed 
of two subrules: (discrimination between Dq and D^) and (discrimination within 
Dq). On the other hand, a classification rule for m.c.h. is composed of three 
subrules: (discrimination between Dq and Dg,), (discrimination between Dy and 
psycho) and (discrimination within Dy). 

Let us consider the first case. The first part can be obtained by the in- 
tersection in Figure 7. That is, Dg — >■ [not = per] A [prod = 0]; ~\nat = 
per] V -'[prod = 0] — >■ ~'Dg. Then, since from Figure 4, the difference set be- 
tween Li,o{common) and Li ^^classic) is {[prod = 1]}, for a classification rule 
for common within Dy is: [prod = 0] — >■ common. 

Combining these two parts, the classification rule for common is: {-i[nat = 
per] V -'[prod = 0]) A [prod = 0] — >■ common. After its simplification, the rule is: 

-'[nat = per] -p -'common, 

whose accuracy is equal to 2/3. In the same way, the rule for classic is obtained 
as: 

-'[nat = per] A [prod = 1] — >■ classic. 

7 Experimental Results 

The above rule induction algorithm was implemented in PRIMEROSE4.5 (Prob- 
abilistic Rule Induction Method based on Rough Sets Ver 5.0), and was applied 
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common 

classic 

m.c.h. 

i.m.l. 

psycho 

Fig. 9. Grouping by Characterization 
Sets (First Model) 




Fig. 10. Grouping by Characteriza- 
tion Sets (Second Model) 



to databases on differential diagnosis of headache, meningitis and cerebrovascular 
diseases (CVD), whose precise information is given in Table 2. In these experi- 
ments, 5a and 5^ were set to 0.75 and 0.5, respectively. Also, the threshold for 
grouping is set to 0.8 This system was compared with PRIMEROSE4.5[ll], 
PRIMEROSE[9] C4.5[6], CN2[2], AQ15[4] with respect to the following points: 
length of rules, similarities between induced rules and expert’s rules and perfor- 
mance of rules. 

In this experiment, the length was measured by the number of attribute- value 
pairs used in an induced rule and Jaccard’s coefficient was adopted as a similarity 
measure [3]. Concerning the performance of rules, ten- fold cross-validation was 
applied to estimate classification accuracy. 



Table 2. Information about Databases 

Domain Samples Classes Attributes 
Headache 52119 45 147 

CVD 7620 22 285 

Meningitis 141 4 41 



Table 3 shows the experimental results, which suggest that PRIMEROSE5 
outperforms PRIMEROSE4.5 (two-level) and the other four rule induction meth- 
ods and induces rules very similar to medical experts’ ones. 

8 Discussion 

The readers may wonder why lengthy rules perform better than short rules since 
lengthy rules suffer from overfitting to a given data. One reason is that a decision 

^ These values are given by medical experts as good thresholds for rules in these three 
domains. 
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Table 3. Experimental Results 



Method 


Length 


Similarity 


Accuracy 


Headache 


PRIMEROSE5.0 


8.8 ±0.27 


0.95 ±0.08 


95.2 ± 2.7% 


PRIMEROSE4.5 


7.3 ±0.35 


0.74 ±0.05 


88.3 ± 3.6% 


Experts 


9.1 ±0.33 


1.00 ±0.00 


98.0 ± 1.9% 


PRIMEROSE 


5.3 ±0.35 


0.54 ±0.05 


88.3 ± 3.6% 


C4.5 


4.9 ±0.39 


0.53 ±0.10 


85.8 ± 1.9% 


CN2 


4.8 ±0.34 


0.51 ±0.08 


87.0 ±3.1% 


AQ15 


4.7 ±0.35 


0.51 ±0.09 


86.2 ± 2.9% 


Meningitis 


PRIMEROSE5.0 


2.6 ±0.19 


0.91 ±0.08 


82.0 ± 3.7% 


PRIMEROSE4.5 


2.8 ±0.45 


0.72 ±0.25 


81.1 ±2.5% 


Experts 


3.1 ±0.32 


1.00 ±0.00 


85.0 ± 1.9% 


PRIMEROSE 


1.8 ±0.45 


0.64 ±0.25 


72.1 ±2.5% 


C4.5 


1.9 ±0.47 


0.63 ±0.20 


73.8 ± 2.3% 


CN2 


1.8 ±0.54 


0.62 ±0.36 


75.0 ± 3.5% 


AQ15 


1.7 ±0.44 


0.65 ±0.19 


74.7 ± 3.3% 


CVD 


PRIMEROSE5.0 


7.6 ±0.37 


0.89 ±0.05 


74.3 ± 3.2% 


PRIMEROSE4.5 


5.9 ±0.35 


0.71 ±0.05 


72.3 ±3.1% 


Experts 


8.5 ±0.43 


1.00 ±0.00 


82.9 ± 2.8% 


PRIMEROSE 


4.3 ±0.35 


0.69 ±0.05 


74.3 ±3.1% 


C4.5 


4.0 ±0.49 


0.65 ±0.09 


69.7 ±2.9% 


CN2 


4.1 ±0.44 


0.64 ±0.10 


68.7 ±3.4% 


AQ15 


4.2 ±0.47 


0.68 ±0.08 


68.9 ± 2.3% 



attribute gives a partition of datasets: since the number of given classes are 4 to 
45, some classes have very low support due to the prevalence of the corresponding 
diseases. Thus, the disease with the low frequency may not have short-length 
rules by using the conventional methods. However, since our method is not based 
on accuracy, but on coverage, we can support the disease of frequency. Another 
reason is that this method reflects the reasoning style of domain experts. One 
of the most important features of medical reasoning is that medical experts 
finally select one or two diagnostic candidates from many diseases, called focusing 
mechanism. For example, in differential diagnosis of headache, experts choose one 
from about 60 diseases. The proposed method models induction of rules which 
incorporates this mechanism, whose experimental evaluation show that induced 
rules correctly represent medical experts’ rules. 

This focusing mechanism is not only specific to medical domain. In a domain 
in which a few diagnostic conclusions should be selected from many candiates, 
this mechanism can be applied. For example, fault diagnosis of complicated elec- 
tronic devices should focus on which components will cause a functional problem: 
the more complicated devices are, the more sophisticated focusing mechanism 
is required. In such domain, proposed rule induction method will be useful to 
induce correct rules from datasets. 
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9 Conclusion 

In this paper, the characteristics of experts’ rules are closely examined, whose 
empirical results suggest that grouping of diseases ais very important to realize 
automated acquisition of medical knowledge from clinical databases. Thus, we 
focus on the role of coverage in focusing mechanisms and propose an algorithm 
for grouping of diseases by using this measure. The above example shows that 
rule induction with this grouping generates rules, which are similar to medical 
experts’ rules and they suggest that our proposed method should capture medical 
experts’ reasoning. This research is a preliminary study on a rule induction 
method with grouping and it will be a basis for a future work to compare the 
proposed method with other rule induction methods by using real-world datasets. 
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Abstract. In this paper, we address the characterization task and we 
present a general framework for the characterization of a target set of 
objects by means of their own properties, but also the properties of ob- 
jects linked to them. According to the kinds of objects, various links can 
be considered. For instance, in the case of relational databases, associa- 
tions are the straightforward links between pairs of tables. We propose 
eioroctcriX, a new algorithm for mining characterization rules and we 
show how it can be used on multi-relational and spatial databases. 

Keywords: Machine Learning, Inductive Logic Programming, Data 
Mining, Characteristic Rules, Relational Databases, Spatial Databases. 



1 Introduction 

Characterization is a descriptive data mining task which aims at mining concise 
and compact descriptions of a set of objects, called the target set. It consists in 
discovering properties that characterize these objects, taking into account their 
own properties but also properties of the objects linked to them. 

In comparison to classification and discrimination, characterization is inter- 
esting since it does not require negative examples. This is an important feature 
for some real world applications where it is difficult to collect negative examples. 

Several fields have contributed to this task. On the one hand, characterization 
has been treated as descriptive generalization in the field of Machine Learning 
[12]. Characterizing a set of objects has also been considered as computing the 
least general generalization (l.g.g.) in Inductive Logic Programming [14], but 
such an approach leads to complexity problems. An object oriented view for 
computing the l.g.g. called structural matching has been proposed in [8, 17] and 
applied to air traffic control in [9]. On the other hand, in Data Mining, Han 
et al. [7, 6] have introduced attribute oriented induction for data generalization, 
but in their framework, background knowledge such as taxonomies is needed for 
generalizing data, and objects are described in a single table, which limit the 
applicability of such a method. 
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We can also consider that characterization is close to the task of mining 
frequent properties on the target set. This task has already long been studied 
[1,11,5,16] , since in many systems, it is the first step for mining association 
rules. Nevertheless, most works suppose that data is stored in a single table, 
and few algorithms [3] really handle multi-relational databases. Moreover, the 
frequency (also called the support) is not sufficient to characterize the objects of 
the target set, because it is also important to determine whether a property is 
truly a characteristic feature by considering also the frequency of that property 
outside the target set. 

The approach we propose handles multi-relational databases taking into ac- 
count the structure of the database. It relies on the definition of a Quantified 
Path which is an expression that specifies how to take into account different 
kinds of objects and their relationships, starting from the target objects. For 
instance, considering as a target set the set of films produced by a given person 
Sp and denoted by Movie^sp), the following expression: 

Movie(Sp) ■ ^Award :: Award.kind in{0 scar, GoldenP aim) 
is a characteristic rule which means that each movie produced by Sp has received 
at least one Oscar award or Golden Palm award. The expression Movie(^sp) ■ 
3 Award is a quantified path. It specifies that we are interested in the proper- 
ties satisfied by at least one award received by S'p’movies. On the other hand, 
considering the Quantified Path Moufe( 5 p) : y Award means that we are looking 
for properties satisfied by all the awards received by all Sp's movies. 

At LIFO, we have developed CaracteriX, a levelwise^ algorithm, for mining 
interesting characteristic rules. It starts with the most general Quantified Paths, 
exploring the search space, according to notion of generality between rules. More- 
over, it uses two heuristics, link- coverage and open-coverage, to efficiently prune 
the search space. Another important feature of our approach is the form of the 
rules, which relies on quantified paths defining how to ’navigate’ between sets 
of objects. As far as we know the form of rules we have introduced has not yet 
been used in that field. 

The paper is organized as follows. Section 2 formalizes the problem of mining 
characteristic rules. In Section 3, we give definitions on which our approach relies: 
the notion of quantified paths, properties and characteristic rules, the notion of 
coverage and generality orders. Section 4 is devoted to the general algorithm and 
Section 5 to experiments. 



2 Problem Statement 

The characterization task we are interested in can be formulated as follows: 

— given a set of types Ti, and attributes for describing objects of type Tj, 

— given a set £ of objects, £ =£\ U £2 • • • U where each £i contains objects 
with the same type Tj, 

^ see [11, 13, 15] for a description of levelwise algorithms family. 
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— given a set TZ of binary relations (in the following, denotes a binary 
relation on 6^ x 8j) 

— given a target set Starget, such that there exists i, ^target C 
— > find a set of characterization rules of Starget- 

The size of the search space for the characterization rules depends, among others, 
on the number of relations in TZ and on their cardinalities. Without restrictions 
on the possible forms of the rule, the search space may become so large that the 
learning task is intractable. 

Example 1. Application to relational databases 




Fig. 1. Movies database 



Our approach is illustrated throughout this paper by a running example 
Movies^ given in Figure 1. This database is stored in a relational form composed 
of several files. There is information on actors, casts, directors, producers, stu- 
dios,... The main file Movie is a list of movies described by their category, title, 
year, process, and so on. The actors are listed with their roles in another file 
Casts. More information about individual actors such as name, date of birth, 
gender and origin can be found in the file Actors. The file People gives more in- 
formation about actors, directors, producers, writers, and cinematographers. Re- 
makes links movies to their remakes, whereas Awards gives the different awards 
that can be won by a movie. Finally, Studios provides some information about 
each studio, such as the location and the founder. 

For instance, we could be interested by characterizing the properties of comic 
movies, or the properties of movies produced by a given producer, and so on. 



^ inspired from http://kdd.ics.uci.edu/databases/movies/movies.html 
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3 General Framework 

3.1 Quantified Path 

Definition 1. A Quantified Path (denoted in the following by QV) on Xq is 
a formula: 

Ql Xi . . . Qn Xn 

where n > 0, Xq represents the target objects, and for each i ^ 0, Qi = \/ or 3, 
Xi is a type of objects, and there exists a relationship in TZ between Xi-i and 
Xi. When necessary, in order to remember the target set, it will prefixed by Xq 
leading to Xq : Qi X\ . . . Qn Xn- 

Let us notice that when there exists several relationships between Xi-i and Xi, 
the quantifier Qi may be indexed by the relation used in the QV. 

A QV has a size n that is the number of its quantifiers. 

Example 2. • Links between movies (M) and awards (W) give two paths denoted 
by M : yW and M : 3W . M : yw means ’’all awards of each movie”, while 
M : 3W stands for ’’for at least one award of each movie”. 

• Pname=Hit ■ VMVIL is another path, where Pname=Hit is a target set of people 
(P). This path means that we are interested in all awards of all Hit’s movies. 

Definition 2. We say that two quantified paths are variants if they have the 
same size, if they involve the same type of objects, the same relations in the 
same order and if they differ by at least a quantifier. 

Example 3. If we consider people (P) as a target set and links between people 
and movies (M), we have the four following paths: P : VMVIT, P : yM3W, 
P : 3M3W, P : 3MyW . These QVs are variants of size 2. 

Definition 3. We say that a quantified path i5i is more general than a quantified 
path $2 (denoted by <5i ^ 62 ) iff Si and 62 are variants and for 1 < i < size{Si){= 
size(S 2 )), either: 

- Ql = Ql or 
-Ql = 3 and Ql = V. 

Example j. For instance, we have P : 3M3W >: P : yM3W >: P : VMVIT and 
also P : 3M3W P P : 3MVIT ^ P : VMVVF but P : VM3VF ^ P : 3MVIT and 
P : 3MVIT ^ P : VM3IT. 

3.2 Properties 

A set of properties is associated to each type of objects. We consider many kinds 
of properties such as: attribute=value, attribute G {valuei, . . .,valuen\, attribute 
> value, attribute < value, and even aggregates such as: count, min, max, . . . For 
a type T and a property p on T, we assume that there exists a boolean function 
Vp, such that for each object o of type T, Vp(o) = true or Vp(o) = false. It 
means that a property may be satisfied by an object o or not. 
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Definition 4. We define two basic properties True and False such that for any 
object o, VTrue(o) = truc and Vpaise{o) = false. 

Definition 5. We say that a property p\ is more general than a property p2 
(denoted by pi F p^) iff all objects that satisfy the property p2 also verify the 
property p\ . 

Example 5. The property LP./cmd G {Oscar, GoWenPaZm} where VP represents 
the set of awards is more general than W.kind G {GoldenPalm}. 

3.3 Characteristic Rules 

Definition 6. We define a characteristic rule on a target set Xq as the con- 
junction of a quantified path 6 and a property p , denoted by: Xq : 6 :: p. 

Definition 7. We say that two characteristic rules ri (T : Si::pi) and r2 (T : 
S2 ■’■'P2 ) are variants if (5i and 62 are variants and pi = P2- 

Example 6. Pname=Hit '■ VM :: M. category = Suspense 

is a characteristic rule, where Pname=Hit is a target set of People whose name is 
Hit. This rule means that all Hit’s movies belong to the Suspense category. 



3.4 Coverage 

The notion of coverage is defined for a property p relatively to a quantified 
path 6. It measures the number of objects that have this property. For a rule 
r = Xq : S::p and an object o G Xo, we define Vs:-.p{o) recursively as follows: 

- Wx.s’r.pio) — Vsi.,.,p{oi) A • • • A Vs'::p{on) Or falsc if there is no object linked to o 

- V3x.«'::p(o) = Va'::p(oi) V • • • V Vs>::p{on) Or falsc if there is no object linked to o 

- Vgti..p{o) = Vp{o) , that is true if o has the property p, false otherwise. 

Where oi, . . . o„ are the objects of type X linked to the object o, and <5® is the 
empty path (size 0). 

Example 7. Let us consider the rule: Pd ■ SM3W :: w.kind GjOscar, Golden palm}, 
where Pd denotes the directors in the relation people. 

'^\/M3W::w.kinda{Oscar.Goldenpal7n} i^P) ~ 

'^3W'.:w.kindCi{Oscar.Goldenpalm}{,f A . . . 

where filmi , . . . , filrrim denote the movies directed by Sp. 



Definition 8. 



Coverage is given by the following: 



coverage{r,£target) = 



\{o\oe£target and Vr(o)=true}\ 
\£target \ 



Example 8. Let us consider all the movies as the target set. The coverage of the 
rule M : 3A :: A. gender = female is equal to where 2526 is the number of 

movies with female actors and 11404 is the total number of movies. In the same 
way, we can calculate coverage(M : BA :: A. gender = animal, movies)=jY^. 
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3.5 Generality Order 

Definition 9. We say that a characteristic rule r\ (5\::pi) is more general than 
a rule r2 (S2-':p2) (denoted by ri '^r2) iff Si ^ S2 and pi > p2- We write ri >- r2, 
when ri ^ r2 and ^(r2 ^ ri). 

Example 9. M : 3W :: W.kind in{Oscar, Golden-Palm)'^ 

M : 'iW :: W.kind in(Oscar). 

Lemma 1. Coverage is monotone with respect to the generality order, i.e., 
if coverage(j-2, Starget) > e and ri ^ r2 then coverage{ri, Etarget) > C; or else 
if ^{coverage{ri,Etarget) > e) and ri ^ r2 then ^{coverage{r2, Etarget) > e). 

3.6 Specialization Operator 

Definition 10. We define the specialization operator p as a binary relation on 
the set of characteristic rules as follows: 

p{5 p) = {S' ::p\S' differs from S by one 3 quantifier set to V} U {S::p'\p p' 
and there is no p" s.t. p >- p" >- p'} 

Let us notice that for all r" G p(r), there is no r' ^ p{r) such that r r' and 
r' ^ r" . 

Example 10. Suppose that we consider only the following properties for Actors: 
{Actor. gender = male, Actor. gender = female}, and Movies as the target set. 
The complete search space starting with 3A :: True is given in Figure 2 . 




Fig. 2. Search space starting with the rule Movies: 3 A::True 



The definition of a specialization operator allows to define a top down, lev- 
elwise, search strategy, for mining characteristic rules. For pruning the search 
space, we define two notions: open-coverage and link-coverage. 
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3.7 Link- Coverage 

We define link-coverage {S::p,£target) =coverage{open{5)::True,£target) Intuitively, 
link coverage measures the number of target objects for which there exists at 
least an object linked to them through <5. This can be useful when there is a 0..* 
relation, which means that some objects can be linked to none objects by this 
relation. 



3.8 Open- Coverage 

We define open-coverage (S::p,£target) = coverage{open{5)::p, £target) where open{S) 
is obtained by setting all the quantifiers of i5 to 3. Intuitively, open-coverage 
counts the number of target objects for which there is at least an object linked 
to them by <5 and satisfying p. 



3.9 Interesting Characteristic Rules 

For a rule S :: p, coverage measures the number of objects in the target set 
having the property p. We would like to estimate whether this property is really 
characteristic of Star get or not. This can be achieved by verifying if the property 
covers enough objects in the target set, while covering few objects outside the 
target set. One should find a trade-off between these two conditions and estimate 
the quality of rules. 

Furthermore, in descriptive data mining tasks, such as characterization, thou- 
sands of rules may be discovered, so making the rule filtering step as a necessary 
post processing step. In our framework, we define a function named Interesting 
that can filter the rules relying on such heuristics in order to keep only inter- 
esting ones. In [10], Lavrac et al. analyze some rule evaluation measures used 
in Machine Learning and Knowledge Discovery. They propose only a measure 
that can be considered as a measure of novelty, precision, accuracy, negative 
reliability, or sensitivity. In our experiments, we used their novelty measure: the 
novelty of a rule H < — B is given by: (P represents a probability) 

Novelty{H < — B) = P{HB) - P{H) * P{B) 

For a characteristic rule r, for each object o G £,we can consider the facts o G 
Starget and Vr{o) = true. We are looking for a strong association between these 
two facts. This one can be estimated by the novelty measure. In our framework, 
the novelty of a rule can be estimated by: 

\{o\o G Starget and Vr{o) = true}\ 

Noveltyir) = ^ LI 

\Starget\ |{o|o G S and Vr{o) = true}\ 

According to [10], we have —0.25 <Novelty(r)< 0.25. A strongly positive 
value indicates a strong association between the two facts. 
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Function Interesting (r, Starget)- boolean 
If Novelty(r) ^ 0.25 then return True 
else return False 



We can also use other measures such as entropy, purity, or Laplace estimate. See 
[4] for more details. In addition to the novelty we used in our experiments the 
Laplace estimate given by: 

Laplaceir) = co.erage(r, etarg.t) + l 



coverage{r, Starget)-\~coverage{r, S — Star get ) + 2 



0 <Laplace(r)< 1. If a rule covers no examples, then Laplace is equal to 0.5. 



4 Algorithm 

We can use a variant of the levelwise algorithm [11] for mining all potentially 
interesting characteristic rules. 

CaracteriX Algorithm 

input Cl = {r, such that there is no r', r' y r } 
i = I 

while Ci yf 0 

1 . Ti = {r e Ci\link-coverage{r,£target) > e} 

2. T'i = {r & J^i\open-coverage{r,£target) > e} 

3. T”i = {r e T'i\coverage{r,£target) > e} 

4. C,+i = (Up(r)|rGr'.)\U,<.C, 

5. i = i+l 
end while 

output {r G \J-^-T”j\InteresUng{r,£target)} 

CoracteriX starts with Ci, the set of the most general characteristic rules 
given by the user. The algorithm then iterates coverage tests (lines 1,2,3) and 
generation of next candidate rules (line 4), taking care to discard previously 
considered rules. The iteration stops when it is not possible to generate further 
candidate rules. Pruning heuristics, link-coverage (line 1) and open-coverage (line 
2) are used to reduce the number of coverage evaluations done in line 3. Open- 
coverage and, a fortiori, link-coverage are the same for variant rules. They are 
stored and retrieved as needed to avoid unnecessary computations. Let us notice 
that these pruning strategies only exclude characteristic rules that do not fulfill 
the minimum coverage requirement e. The algorithm then outputs the set of all 
interesting rules. 

Lemma 2. £aracteriX is correct and complete w.r.t. C\. 

Proof. The proof relies on the following inequality: link-coverage{r,£target) > 
open-coverage(^r , ^ cover age{r, Star get)- 



5 Experiments 

The model that we have proposed and the system CarocteriX have been devel- 
oped by the first three authors at lifo , and experimented on a real geographic 






Learning Characteristic Rules Relying on Quantified Paths 479 



database provided by the BRGM^. The rules that have been learned have been 
evaluated by a geologist expert (the fourth author of the paper). For this pur- 
pose, we have extended our framework in order to take into account the spatial 
dimension, mainly the topological and distance information between geographic 
objects. In our experiments we have used a GIS [2], which handles many layers: 
geographic, geologic, seismic, volcanic, mineralogic, gravimetric,. . . . These layers 
store more than 70 thousands geographic objects. We aim at finding characteri- 
zation rules for characterizing mineral ore deposits using geological information, 
faults, volcanos . . . This task can be stated as follows: 

- given a set £ of geographic objects, £ =£\ U £2 U £3 U £4 U £5, where £\ contains 
mineral deposits, £2 represents the geology, £3 the volcanoes, £4 the faults and 
£5 the seisms; 

- given a set TZ of binary relations based on spatial proximity; 

- given a target set £target ={gold mines} C £1; 

— > find a set of characterization rules of {gold mines} 

To take into account the distance between objects, we introduce a parameter 
A and represents a binary relation between objects in £^ and objects in 
£j parameterized by A. In the case of geographic objects, this parameter may 
denote the distance between objects. For instance represents a binary 

relation between mineral deposits and volcanoes at a distance less or equal to 
100 kms. As a consequence, the notion of quantified path described in section 
3.1 has been extended, considering the parameter A used in binary relations. For 
instance: M : yiokmFibkmV denotes all the volcanoes at less than 5 kilometers 
than faults at less than 10 kilometers than each mine. In order to handle distance 
information between objects, we construct growing buffers around target objects 
progressively, while checking for the properties satisfied by objects entering into 
the buffers. This notion is illustrated by Figure 3, where buffers are constructed 
around mineral deposits. 

The Quantified path generality order defined in Section 3.1 can be extended 
to such parametrized quantified paths. In fact, in the case of characteristic rules 
with one parameter, we have:(5A ^ <5 a' if (A > A' and A, A' indexes a 3) or 
(A < A' and A, A' indexes a V). We have: 

M : y^KruF ^ M : W^RmF >: M : VioiCmA 
M : BioKmF h M : ^^KmF ^ M : BsxmF 
Intuitively, this means that if a property holds for all faults at a distance less 
than 10km from a mine, then this property also holds for all faults at less than 
5km and 3km from this mine. Vice versa, if there exists a fault at less than 3km 
from a mine with a given property, than there exists a fault at less than 5km 
and less than 10km with the same property. 

When we have more than one parameter, we can induce a partial order, by 
taking into account the relation <5ai,...,a„ ^ if Vi, (Ai > A' and Aj,A' 

indexes a 3) or (Ai < A' and Aj, A' indexes a V). 
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Fig. 3. Buffers around some target points in the GIS. Layers represented here are 
geology, mineral deposits, fault and volcanoes 



Table 1. Some examples of tested rules 



Rule 


Coverage 


Laplace 


Novelty 


M: M.Era G {Mesozoic, Cretacious} 

M: M.Era G {Mesozoic, Jurassic, Cretacious} 
M: M. Lithology = sedimentary deposits 
M; M.Lithology=volcanic deposits 
M: M.Distance_Benioff G [170. .175] 

M: ^lokm G::G.Age=tertiary 
M; ^ 5 km V::V.Age=recent 


4.59% 

6,42% 

5,50% 

64,22% 

66,97% 

86,24% 

7,34% 


0,750 

0,148 

0,070 

0,266 

0,365 

0,259 

0,310 


0,0080 

-0,0133 

-0,0413 

0,0102 

0,0529 

0,0086 

0,0030 



5.1 Results 

Our system tested hundreds of rules. Some examples are given Table 1. 

The following rule has been discovered and covers 60% of gold mines and 
rejects most of the other mines. 



: 3iofcm G :: M.MainSubstance= au/\ 

G.CodeGeology= TertiaryV olcanicA 
M.BenioflfDepthG [75..150]A 
M.Distance_BenioffG [170..275]A 
M.BenioffSlope G [8°..16°]A 
G.Age= tertiary A 
M.Lithology= volcanicA 
M.Gitology= epithermalA 
M.Morphology= veins 



This rule, considered as interesting by experts, expresses that for all gold 
mines, there exists a tertiary volcanic geology at a distance less than 10 km 
from this mine, and these mines are epithemal ones with a morphology of veins 
and are at a benioff depth between 75 and 150 km and at a slope benioff of 
8° and 16°. According to geologist experts, this rule is interesting because it is 
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Fig. 4. Link Coverage of the rule M : 3aF 3sV::True 



related to a natural phenomenon: the plate tectonics. 

Figure 4 illustrates the notion of link-coverage and represents the number of gold 
mines that contain at least a fault in a buffer of size A around the mine and 
such that the fault contains at least a volcanoe in a buffer of size B around this 
fault. 

6 Conclusion 

In this paper, we have presented a new general approach for mining a new 
kind of characteristic rules in a target set of objects. These rules handle both 
properties and quantified paths. These latters specify how to take into account 
different kinds of objects and their relationships, in other words, how to go 
from objects to others without flattening the tables describing these objects. 
We propose CaracteriX, a levelwise algorithm exploring the search space looking 
for characteristic rules, taking into account a generality relation between rules. 
Moreover, the notions of link-coverage and open-coverage are useful heuristics 
to prune the search space. We have experimented our approach on a geographic 
database and we have submitted our rules to geologists. They considered that 
these rules are interesting and give a good description of a set of chosen target 
objects. Quantified paths give a convivial way to look for the characteristics of 
the target objects according to the spatially linked objects. In the future, we aim 
at extending our framework on other kinds of databases, such as object oriented 
databases. 
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Abstract. This paper describes a semi-supervised algorithm for single class learn- 
ing with very few examples. The problem is formulated as a hierarchical latent 
variable model which is clipped to ignore classes not of interest. The model is 
trained using a multistage EM (msEM) algorithm. The msEM algorithm maxi- 
mizes the likelihood of the joint distribution of the data and latent variables, under 
the constraint that the distribution of each layer is fixed in successive stages. We 
demonstrate that with very few positive examples, the algorithm performs better 
than training all layers in a single stage. We also show that the latter is equivalent 
to training a single layer model with corresponding parameters. The performance 
of the algorithm was verified on several real-world information extraction tasks. 



1 Introduction 

Several real world problems fall into the category of single class learning, where training 
data is available for only a single class. Examples of such problems include the identifi- 
cation of a certain class of web-pages - e.g., “personal home pages” or “call for papers” 
[10]. Building training data for such problems can be a particularly arduous task. A 
good sample of the positive class must involve all aspects that can lead to inclusion in 
the positive class. Constructing a negative class would require a uniform representation 
of the universal set excluding positive class [10]. 

Information extraction is another area where single class learning problems arise 
naturally. Information needs of users are too diverse and numerous to allow the creation 
of significant numbers of labeled examples. Consider, for example, an oil company’s 
corporate reputation management group interested in monitoring articles about its and its 
competitor’s image in the areas of diversity at work place, oil spill issues, environmental 
policies etc. Obtaining comprehensive examples for every one of these topics is almost 
impossible. Users are typically willing to provide only very few carefully crafted positive 
examples for each topic of interest. 

The need for single class learning has been recognized and there have been a few 
previous efforts focusing on learning from positive examples. In [7], the algorithm maps 
the data using a kernel and then uses the origin as the negative class. In practice this 
was found to be very sensitive to parametric changes and some heuristic modifications 
were suggested to include more than just the origin into the negative class [4]. Recently 
[10] includes unlabeled examples in an iterative framework that identifies examples not 
sharing features with positive examples, which are then treated as negative examples for 
training a support vector machine. These approaches have concentrated on identifying 
negative examples and using them in a discriminative training framework. The motivation 
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in these approaches has been towards building classifiers that do not degrade in accuracy 
with the growth in the size of labeled data [10]. 

Generative modeling approaches have also been applied to the problem of partially 
labeled data. Unsupervised approaches use joint distributions over the features to iden- 
tify clusters in the data. In particular, finite mixture models trained using the popular 
Expectation-Maximization (EM) algorithm [2] have been used extensively. An interest- 
ing approach in [5] modifies the EM algorithm to allow the incorporation of labeled data. 
This approach can, in theory, be used with small amount of labeled data and [5] reported 
encouraging experiments on multi-class problems where labeled data are available for 
each class. A variant of this approach to the single class problem, but with larger amounts 
of labeled data, has been described in [3] with good results. 

In this paper we focus on the single-class learning problems with the following 
two characteristics: (1) The topic of interest only constitutes a very small proportion 
of candidate data, and (2) The topic is specified by very few positive examples (seeds) 
which usually do not represent a fair sample of the topic. For such problems, single 
stage clustering algorithms do not perform well: The precision is low unless the number 
of clusters is large, while the recall is low unless the number of clusters is small. In 
order to overcome this, we use a hierarchical latent variable model trained with a novel 
multistage EM (msEM) algorithm. The algorithm concentrates on the class of interest, 
guided by the labeled examples. Experiments show that the algorithm generalizes well 
from small number of seeds that form skewed samples of the desired topic. 

2 Latent Variable Models and Semi-supervised EM Algorithm 

2.1 Latent Variable Model 

One commonly used model for clustering is a mixture model of the form 

P{z) = ^p{z\a) ■ p{a). (1) 

a 

where the variable a is a latent variable and is interpreted as class label. Training of this 
model involves adjusting the parameters of the probability distributions p( 2 | a) andp(a). 
This model can be trained effectively using the EM algorithm [2]. 

Given a dataset z = {z\, Z 2 , ■ ■ ■ , Zn} of individual observations of z, the EM algo- 
rithm is an iterative algorithm that maximizes the log-likelihood of the model, 

^logp(zi) = ^log^p(zi,a). (2) 

i i a 



2.2 Semi-supervised EM Algorithm 

The EM algorithm for maximizing (2) is an unsupervised algorithm. However, some 
prior information is available. E.g. , in single class classification, we often have some seed 
information - a few labeled examples for the class of interest. Incorporating such seed 
constraints into the EM algorithm results in a semi-supervised EM algorithm (ssEM) [5]. 
We use the version as shown in Fig. 1 . 
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- Set initial model parameters for the distributions p{z\a) and p{a). 

- Iterate until convergence over the following two steps: 

E-Step: For each Zi, compute q{a\zi) by Bayes rule and seed constraint 

I q{a\zi) = p{a\zi) = p{zi, a)/p{zi), Zi 0 Seeds 
]q{a = l\zi) — I, Zi € Seeds. 

M-Step: Estimate new parameters forp(«|a) andp(a) by maximizing 

^ q{a\zi) log p{zi, a). (4) 

i,a 



Fig. 1. Semi-supervised EM algorithm (ssEM) 



However, with very few labeled examples, seed constraints alone are not sufficient 
to tackle the problem at hand. For the task of identifying a single class from multiple 
possibilities, there is a trade-off between the number of components in the mixture model 
(1) and the precision of the chosen class. If the number of components in the mixture 
model is small, the chosen component is likely to contain a large number of spurious 
datapoints. If the number of components is large, the desired class might be fragmented 
among many different components. We now proceed to describe more powerful models 
and algorithms to address this. 

2.3 Hierarchical Latent Variable Models 

A two level hierarchical model can be obtained by replacing the model (1) with 

P(^) = P{Ao-o,ai) ■ p{ai\ao) ■ p{aq), (5) 

ag ,ai 

where aq and a\ are two levels of latent variables in the hierarchy. 

A full-blown model of the form (5) can be expensive to train due to the combinatorial 
effect of the hierarchical hidden variables in the E-step. Since we are only interested in 
a single class, it is intuitively plausible that clipping off the branches in the hierarchical 
model not corresponding to the class of interest would reduce a substantial amount of 
computation without large impact on performance. This suggests using the following 
“clipped model” (assuming a = 1 corresponds to the class of interest) 

P{zi)= '^p{aq)'^p{ai\aQ)-p{zi\aq,ai)+'^p{aq)p{zi\aq). ( 6 ) 

ao = l <^0^1 

However, as the following two lemmas show, training either of them with ssEM does 
not exhibit advantage over single layer models. The full potential of the clipped model 
can be realized by a new training algorithm, which we propose in Section 3. 

Lemma 1. The hierarchical model (5) can be represented as a single layer model. They 
both behave identically under standard EM training algorithm. 
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Proof. The model (5) is a marginal distribution of 

p{zi, ao, ai) = p{z,\ao, ai)p{ao, ai) (7) 

Suppose the combination (ao, a\) takes n distinct values. For example, if ao takes no 
values and oi takes ni values, then n = noui. Introduce a variable c with n distinct 
values, the distribution (5) is identical to 

p{zi) = '^p{zi\c)p{c). (8) 

C 

The derivation of the training algorithm involves expressions of p{zi\ao,ai) and 
p(ao,ai), which can be replaced with expressions of p{zi\c) and p{c). The resulting 
training algorithm is therefore equivalent after simple renaming of parameters, q 

Lemma 2. The clipped hierarchical model (6) can be represented as a single layer model. 
They both behave identically under standard EM training algorithm. 

Proof. Since a\ does not occur for ao 1, we can arbitrarily set oi = 1 for ao 1. 
Then (6) is a marginal of 

ao, ai) = p{z^\ao, ai)p(ao, ai) (9) 

wherep(zj|ao, ai) = p(zi|ao) andp(ao, ai) =p(ao)forai = 1 . This case is analogous 
to that of Lemma 1, except that (ao, ai) would take value in a discrete set with 2n — 1 
elements. Hereafter the proof follows that of Lemma 1. q 

3 Training Clipped Model with Multistage EM Algorithm 

For problems involving large number of components in the mixture model, the semi- 
supervised EM can be successfully applied when labeled data are available for different 
classes [5]. However, a more powerful algorithm is needed when only a few labeled dat- 
apoints are available for just one class. We propose a multistage EM algorithm (msEM) 
to train the clipped model, which is more suitable for such problems. 

3.1 Generalized Form of Likelihood 

The log-likelihood (2) can be written in a more general form [1] 

NY^q{z,)log'^p{zi,a), ( 10 ) 

i a 

where q{zi) = 1/N is the empirical distribution of the data and N is the size (number 
of datapoints) of the dataset. Using (10), the M-step (4) of the EM algorithm can now 
be written as maximizing 

q{zi) q{a\zi) log p{zi, a). (11) 

i,a 

The convergence properties of the EM algorithm still hold even when q{zi) is not the 
uniform distribution over the observed data z [1]. The EM algorithm can therefore be 
regarded as a mapping q{z) — >■ q{z, a). 
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- For each layer m = 0, 1, . . . , set initial model parameters forp„i( 2 ;|a) and pm{a). 

- Set the first layer empirical distribution qo{zi) = q{zi) = 1/N for all datapoint Zi. 

- Iterate until convergence: 

• For each layer m = 0, 1, . . . , carry out the following three steps 

* E-Step: For each Zi, compute qm{am\zi) by Bayes rule and seed constraint 

lqm{a,n\zi) = Pm{a,n\zi) ^ Pm{zi,am)/Pm{zi), Seeds 

\qm{am = l\zi) = 1, 2 i e Seeds. 

* M-Step: Estimate new parameters fox pm{z\a) andpm(a) by maximizing 

Q-m(Zi )0'm (Um I logPm (^*5 Um) • (13) 

i^am 

* Set empirical distribution for the next layer qm+i{zi) = qm{zi\am = 1) for 
all datapoint Zi using Bayes rule 

qm{Zi\cim'} ~ qm{Zi')qm{cijri\Zi) / qm{cLjn') • (14) 



Fig. 2. Multistage semi-supervised EM algorithm (msEM) 



3.2 Multistage Semi-supervised EM Algorithm 

The msEM algorithm for the clipped model trains each layer successively by incorpo- 
rating the empirical distribution from the previous layers (Fig. 2). 

Comparing the E and M-steps of the above algorithm, (12) and (13), with those of the 
ssEM algorithm, (3) and ( 1 1 ), it can be seen that the computation for layer m implements 
the EM algorithm that maximizes the generalized log-likelihood 

Pm 1 O’m ), (15) 

i O-m 

where qm{zi) is given by qm{zi\am = !)• The intuition behind this algorithm is that, 
by weighting each datapoint with qm{zi) instead of the uniform distribution, we are 
deemphasizing those Zi that are less likely to be in class 1 as predicted by layer m — 1. 
The discrimination in layer m could then conceivably concentrate more on the finer 
details not addressed at layer m — 1. Each layer acts as a regularizer to restrict the 
variability of the model in the next layer. 

3.3 Deriving the Updated Empirical Distribution 

In the msEM algorithm, the empirical distribution q^ of layer m is computed from 
the results of layer m — 1. The rule for computing qm can be derived from a global 
optimization problem involving layers 0 through m. The objective function for layer m is 

q{zuaQ )log^p(zi,ao , . . . , CE-77^) . 

2,ClO 1 — 1 



(16) 
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Maximizing (16) for successive m with ssEM implies that layer m is trained under the 
constraint that q{zi, ag, . . . , am-i) is fixed. We now show that this is indeed equivalent 
to the msEM given in Eig. 2. 

Eor layer 0, the objective function (16) reduces to 

^g(zi)log^p(z*,ao). (17) 

i O.Q 



The ssEM algorithm for maximizing (17) is the same as the msEM algorithm for layer 0, 
with the following substitutions, 

q{zi,ao) = qo{zi,ao), p{zi,ao) = po{zi,ao). (18) 

Eor layer 1, the objective function (16) reduces to 

'^q{zi,ao)log'^p{zi,ao,ai). (19) 

i,ao 0-1 



The E- and M-steps corresponding to (3) and (11) are therefore 



q{ax\ao,Zi) =p{ai\ao,Zi) = p(zj, ai|ao)/p(zj|ao), ^ Seeds 
q{ai = l|ao,2i) = 1, Zi € Seeds. 

maximize E q{zi,ao)q{ai\zi,ao)logp{zi,ao,ai). 

Zi,aQ,ai 



(20) 

( 21 ) 



Since the model is clipped, it is clear that the E-step (20) is equivalent to (12) with the 
following substitutions. 



Pi{ai,Zt) = p{ai,Zi\ao = 1), qi{ai, z^) = q{ai, Zi\ao = 1) ■ (22) 



The M-step objective function in (21) can be expanded into three terms. 



E9(«o)logp(ao)+ ^ ^ q{zi,ao) log p{zi\ao) 
+ EE q{ao)q{z^\ao)q{ai\zi, Oq) logp{zi, Oi |oo). 

ao — l Zi,ai 



(23) 



Maximizing the first two terms leads to p{zi, ag) already calculated in (18). Using the 
definitions of pi and qi in (22) and the fact that the third term only involves qq = 1, it 
can be verified that maximizing the third term is equivalent to maximizing (13), provided 
that the empirical distribution is given as qi{zi) = qg{zi\ao = 1). Therefore we have 
proved the equivalence for layer 1 . 

Similarly it can be shown that for any m, the E- and M-steps in the msEM are 
equivalent to the steps in an ssEM that maximizes (16), provided that qm+i{zi) = 
qmiZi\ttjn — !)• 
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3.4 Considerations on the Convergence of Multistage EM 

In our experiments the msEM always converged with speed similar to ssEM. Here 
we outline an approach to prove the convergence. It is known that each iteration of 
the standard EM algorithm increases the likelihood, which converges to a stationary 
point [2]. Under certain conditions, this also results in the convergence of the probability 
distributions p and q to fixed points [8]. Under certain stronger conditions, the mapping 
q{z) q{z, a) is continuous. Assuming that each layer of the clipped model satisfies 
all these conditions, the convergence of msEM can be proved by induction. Layer 0 
implements ssEM so qo(z, a) converges. Suppose qm-i{z, a) converges. Then qm{z) = 
qm-i{z\a = 1) also converges. The continuity of the mapping qm{z) — 1- qm{z, a) then 
implies the convergence of qm{z, a). 

3.5 Multistage EM Interpreted as Reverse Boosting 

We note briefly the relationship between the msEM algorithm and boosted density esti- 
mation [6]. An extensive description of this relationship is available from [11]. 

Letao:m = (oo, • ■ • , Om) and denote by ao:m = 1 the condition oq = • • • = Om-i = 

1. The msEM algorithm can be regarded as building successively more complex models 
with weighted weak learners. At stage m the model built so far is 

Pm{z) = ^ p{ao)p{z\ao) 

ao^l 

+p{ao = 1) ^ p{ai\ao = l)p{z\ao = l,ai) 

Oi/l 



m—1 

+ n ^ I|a0:/-1 = ^)'^p{am\a0:m-l = I)p(^|a0:m-1 = l,am)- 

Z— 0 a-m 

(24) 

The msEM algorithm attempts to improve the classification of a single class by getting 
successively better density estimates for that single class subject to the seed constraints. 
The weak learner chosen at layer m is a finite mixture model Pm{z, Um). trained with 
ssEM. In contrast to the boosted density estimation [6], which emphasizes regions with 



1. Set the initial weights Wi = 1/N. 

2. for m=l to M 

(a) Use EM algorithm to compute Pm{zi\am) and Pm{o-m) that maximizes 

Wi logJjjj J3m(2i|am)Pm(am) subject to Seed constraints, obtaining 
qm{am\zi) in the process. 

(b) Set Wi = ?m(2i|«m = 1), Calculated with Bayes rule (14). 

3. Output final model Pm 



Fig. 3. Multistage EM interpreted as reverse boosting 
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Table 1. Experimental datasets and topics 



Dataset 


# Entities 


# Datapoints 


# Features 


Half-window 


Topics 


# Seeds 


TN 


2151 


87,251 


24,808 


400 characters 


“chip”, “web” 


3 


OP 


10 


30,000 


77,762 


75 words 


“diversity” 


2-8 



Intel has not reduced its capital spending budget o/$7.5 billion for the year, in part to 
accommodate the introduction of 300-millimeter wafer production. Chips produced on 
the new wafers will also be made with the more advanced 0.13-micron manufacturing 
process and contain copper wires. Intel currently makes its chips with the 0.18-micron 
manufacturing process and uses aluminum. The micron measurements refer to the size of 
features on the chip. The shift will result in smaller, cooler, faster and cheaper processors. 
"Intel expects chips produced on 300-millimeter wafers to cost 30 percent less than those 
made using the smaller wafers, " Tom Garrett, Intel’s 300-millimeter program manager, 
said in a statement. 



Fig. 4. Example passage for the chip manufacturing topic 



high uncertainty, the msEM emphasizes regions with high certainty of being in the class 
of interest, by increasing the weight of datapoints that perform well in the previous layer. 
The msEM algorithm in boosting framework is shown in Eig. 3, where the weight Wi 
corresponds to qm-yi{zi) in the msEM algorithm. 

4 Experimental Setting 

4.1 Data Sets 

We experimented with msEM and several existing algorithms on two real-world docu- 
ment collections. The datasets are formed from passages in documents crawled from the 
Web. For each document, a set of proper names are identified as being of interest (named 
entities). A passage of a fixed window size surrounding each named entity is taken as its 
context. An example of such a passage is shown in Fig. 4. The context is tokenized into 
words (discarding punctuations), removing stop words (a list of 232 common words) 
and stemming (using Porter’s Stemmer), resulting in a vector of feature counts. The 
named entity and the context feature counts together form a datapoint Zi. Note that each 
document can provide multiple datapoints. 

Two test collections were made (Table 1). The collection TN was gathered from 
the Tech News section of CNet (www . cnet . com). A named entity tagger was used to 
identify organizational names (such as Intel, IBM, Microsoft etc.). The collection OP 
consists of web-pages discussing oil companies. A list of organizational names (each 
containing variations of the same organizational name) was obtained from industry 
experts. Some of these datasets will soon be made publicly available for other researchers 
to test their algorithms. 
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4.2 Topics and Seeds 

The experiments were conducted on the following three topics: 

chip Steps taken by semiconductor manufacturers to produce cheaper, faster and ther- 
mally more efficient microprocessors and microchips. (Dataset TN) 
web Web Service protocols for business process integration. (Dataset TN). 
diversity Issues related to diversity at work-place. (Dataset OP). 

For the “chip” topic, three seeds occurring in two documents were identified from the 
corpus as relevant to the query. For the “web” topic, three seeds from three documents 
were selected as seeds. The “diversity” topic was used extensively to understand the 
behavior of the different algorithms. A series of experiments with several seed sets of 
different characteristics were used, as described below. 

4.3 Algorithms, Parameters, and Evaluations 

As comparison to msEM, the results are evaluated against several alternative algorithms: 
semi-supervised EM algorithm on single-layer latent variable models (ssEM)[5], a sim- 
ple nearest neighbor algorithm (NN)[9] and a proximity pattern search (PPS). 

Eorthe ssEM algorithm, we usedp(z|a) = p{x\a)p{y\a) where a; and y are the named 
entity and the context feature count vector, respectively, p{x\a) is a discrete distribution, 
andp(y|a) is the multinomial distribution. Laplace’s smoothing is used in theM-step [5]. 
Let k be the numbers of components in the mixture model. The parameters of p{z\a) 
are initialized by assigning q{a = j\zi) = 1/k + e, j = 1, . . . ,k, for non-seed Zi and 
q{a = l\zi) = 1 for seed Zi, followed by an M-step, where e is a small noise used for 
breaking symmetry among the components. 

Each layer in the msEM algorithm uses the same model as in the ssEM algorithm 
with exactly two components (k = 2). The initializing of each layer is also the same as 
above, except that no symmetry breaking is necessary. 

The NN algorithm is performed on the same tokenized context (without the named 
entity) using cosine similarity on the feature count vectors. A datapoint is deemed on 
topic if the similarity is higher than a cutoff value. 

Eor the topics “chip” and “web” we also tested a type of proximity pattern search 
(PPS), based on patterns given by domain experts. A data point is deemed on topic if 
the pattern matches within a distance of the named entity in the original (not tokenized) 
context. 

Each of these algorithms contains a parameter that controls the trade-off between 
precision and recall: number of layers for msEM, number of nodes for ssEM, similar- 
ity cutoff for NN and the distance for PPS. Results for different values of the control 
parameters are shown in the next section. 

Results of the algorithms were manually evaluated by domain experts to establish a 
set of ground truth. Since the datasets are too large to be evaluated completely, only results 
from the high-precision versions of each algorithm are pooled together and evaluated. 
This allows us to calculate the precision of these algorithms and the number of correctly 
retrieved datapoints, but not the actual recall (which requires knowing the total number 
of on-topic datapoints in the whole corpus). 
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Fig. 5. Precision vs number of correct datapoints 

5 Results 

Our experiments with ssEM [5] consistently returns very poor results. The algorithm 
was run for several different number of components ranging from 20 to 100 and the 
results were significantly worse than those of the other algorithms. We therefore do not 
report the actual numbers in this paper. 

5.1 Results on the TN Collection 

The results obtained by the algorithms on different parameter settings on the TN dataset 
are shown in 5(a) and 5(b). An immediate glance of these figures indicates that msEM 
is a clear winner for topic “chip”, outperforming the other algorithms significantly. Eor 
topic “web” the NN algorithm is very competitive and the results drop off at 0.35. A 
look at the results of the pattern-matching proximity search helps shed more light on the 
relative performance between the msEM and nearest neighbor. Eor topic “chip” the best 
performance for proximity search is a precision of 0.4 which drops to a low of 0.3 with 
increasing recall - while for topic “web” the best performance is about 0.9 dropping to 
a low of just below 0.45. This seems to suggest that topic “web” is dehned by simpler 
patterns than topic “chip”. Closer examination of the results and discussion with the 
domain expert revealed that the “web” topic was particularly narrow that can be effec- 
tively dehned by spotting a few keywords. In effect the smoothing/generalization effect 
provided by the msEM algorithm did not provide any advantages - instead worsened 
the results. 

5.2 Effects of Seeds 

To better understand the effect of seeds on the two algorithms msEM and NN, we 
conducted several experiments on the “diversity” topic using the dataset OP. This topic 
consists of several subtopics such as issues concerning minority, domestic partner beneht, 
gender equality, etc. The complexity of the task is increased due to the fact that these 
subtopics share very few words in common. We identihed two types of seeds: 
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Fig. 6. Precision vs number of correct datapoints for Topic “diversity” 



- General seeds: passages discussing work place diversity policies in general. 

- Specific seeds: passages discussing a specific instance of a company changing its 
policy on domestic partner benefits. 
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The results of NN and msEM are shown in Fig. 6. Several observations can be made. 
The NN is almost always better at highest precision. The msEM is almost always better 
at generalization. With general seeds, both NN and msEM perform comparably. With 
specific seeds, the NN algorithm almost exclusively retrieves datapoints that deal with 
the same specific instance as the seeds (conhrmed by examining the actual retrieved 
passages). In contrast, at intermediate precision, the msEM can generalize significantly 
better than NN. The better performance of the NN at the range of very high precision 
and low recall is due to its retrieving only datapoints very similar to the seeds. At this 
range the generalization ability of the msEM is not particularly useful. On the hand, the 
NN fails to generalize for the specific seeds, which forms a skewed sample of the topic. 
The msEM is able to generalize better because its retrieval set is not entirely defined by 
similarity to seeds — the clustering of the unlabeled data also plays an important role. 
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Abstract. Decision tree methods generally suppose that the number of 
categories of the attribute to be predicted is fixed. Breiman et ah, with 
their Twoing criterion in CART, considered gathering the categories of 
the predicted attribute into two superclasses. In this paper, we propose 
an extension of this method. We try to merge the categories in an optimal 
unspecified number of superclasses. Our method, called Arbogodai] allows 
during tree growing to group categories of the target variable as well as 
categories of the predictive attributes. At the end, the user can chose to 
generate either a set of single rules or or a set of multi-conclusion rules 
that provide interval like predictions. 

1 Introduction 

Induction trees are among the most popular supervised methods proposed in 
the literature. They are appreciated for the simplicity and the high efficacy of 
the algorithms, for their ease of use and for the easily interpretable results they 
provide. Hastie et al. [3], p. 313, designate them as the learning tool that comes 
closest to the requirements of an “off-the-shelf” method. 

Many induction trees methods have been proposed so far in the literature. 
Some like ID3 [6], C4.5 [7] and CHAID [4,5] build n-ary trees, others like 
CART [2] produce binary trees or, like SIPINA and Branching Programs [11,12], 
latticed graphs that generalize trees by allowing the merging of nodes. 

All these methods were originally intended for categorical attributes and re- 
quire therefore that quantitative variables be discretized. This discretization can 
be done at once before growing the tree. Most of the tree growing methods, nev- 
ertheless, handle quantitative variables in an automatic manner by dynamically 
choosing the optimal discretization thresholds at each node. Some methods also 
attempt to reduce the number of categories of nominal attributes by partitioning 
them into a smaller number of classes. CART, for example, merges the categories 
into two new superclasses at each new split. This has the advantage of avoid- 
ing to uselessly increase the number of nodes. Indeed, the higher the number of 
nodes, the greater are the chances that some of them will have too few cases to 
get reliable estimates of the response classes probabilities. 

There are two main ways for partitioning the values of the predictive at- 
tributes. The first is for instance a characteristic feature of CHAID [4]. At each 
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node, the local discriminating power of each categorical attribute is tested using 
all possible partitions of its values. Partitions in two or more groups are explored. 
Thus, for each split, a predictor is selected simultaneously with its locally best 
partition. 

The second strategy is used for instance by Breiman et al. [2] in their CART 
method. At each node, CART looks only for the best bi-partition of each predic- 
tor. It generates thus only binary trees. With their Twoing criterion, the authors 
of CART propose however also a strategy that extends their principle to the re- 
sponse variable. When the response is multi-valued, using Twoing is equivalent 
to seek, for every predictor, simultaneously the best bi-partition of its values and 
the best bi-partition of the response values. The Twoing is the value of the Gini 
impurity for the best couple of bi-partitions and is used for selecting the split 
variable at each node. 

In this paper, we extend the principle of a simultaneous search of a double bi- 
partition. We combine the CHAID and CART approaches. Like CHAID we look 
at each step for the best not necessarily binary partition of the attributes. Like 
CART with Twoing we explore also the partitioning of the values of the target 
variable. Unlike CART, we do not, however, restrict ourself to bi-partitions. At 
each step we look for the simultaneous grouping of the predictor values and of 
the target variable values that optimizes the chosen criterion. We make use here 
of results given in [9,10]. This gives rise to a new induction tree method that we 
call Arbogodai This kind of tree is characterized by a number of value classes 
of the target variable that varies from one node to the other. It is dynamically 
determined at each new split. When the majority class in a leaf contains several 
response values, the corresponding prediction rule becomes a multiple conclusion 
rule. For instance, we can get a rule like “a female customer aged between 30 and 
40 with a monthly income ranging form 4000 to 5000 euros will chose a red or 
blue car” . Indeed, we can easily compute which of the two colors is more frequent 
in the leaf. Hence, we can also derive classical simple rules. With Arbogodai, the 
user has the possibility to chose the kind of rule that best suits her needs. 

The paper is organized as follows. Section 2 introduces the notations and re- 
calls the basic induction tree concepts. Section 3 discusses the optimal reduction 
of the crosstable that crosses at each node the target variable with the predic- 
tive attribute. Then Section 4 introduces the Ar&o^odaz algorithm that looks for 
such an optimal reduction when testing the attributes at each new node. In sec- 
tion 5, we specify the mutiple conclusion nature of the induced rules and propose 
adapted error rates for Arbogodai trees. We report also experimentations that 
attempt to compare the generalization performances of Arbogodai with other 
tree algorithms. Finally, in Section 6 we make some concluding remarks. 



2 Principle of Induction Trees and Notations 

Let i7 be the population concerned by the learning problem. The profile of any 
member w of C is described by p variables, Xi,. . . , Xp, called either exogenous 
variables, predictive attributes or predictors. These variables can be qualitative 
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or quantitative. The set of values taken by Xj is denoted by Xj. We consider 
also a target attribute C, sometimes called response, endogenous or dependent 
variable, and designate by C the set of response values. Like the AT^-’s, C can be 
qualitative or quantitative. Since the attributes Xj and the target variable C 
take only a finite number of different values in a given dataset, the sets Xj and C 
are finite. We denote by rrij the number of different values taken by the attribute 
Xj and by £ the number of different response values c^. Thus, C = {ci, . . . , Ci}. 
The goal of induction trees is then to generate a model </>(Ali, . . . ,Xp) in the 
form of a decision tree for predicting the value of C from the knowledge of the 
values taken by the predictive attributes. The tree 4> is induced from a training 
sample 17^ C 

The growing process of the tree is quite simple. As illustrated in Figure 1, 
the set is iteratively split by means of, at each step, one of the predictive 
attributes Xi, . . . , Xp. The goal is to get distributions among the values of the 
target variable that are as different as possible. The leaves of the trees obtained at 
each step t of the growing process define a partition St of that becomes finer 
and finer with t. The root of the tree corresponds to the trivial partition Sq = 
The tree given in Figure 1 partitions Ql in three subsets corresponding 
to the nodes S 2 , S 3 and S 4 . In leaf S 3 for example, we have the set of cases of Ql 
that take values Xi = male and X 2 < 5000. At step t, the partition St is derived 
from the previous one St-i by seeking the best leaf-attribute couple (sk,Xj), i.e. 
that for which the splitting of Sfc G St~i according to the values of Xj maximizes 
the gain of information on the target variable between St~i and St- The gain of 
information is usually measured as the reduction in uncertainty for the target 
variable or as the increase in the strength of association between the partition 
and the target variable. The growing process stops when the criterion can no 
longer be improved or when some stopping criterion is reached. 

Let n be the grand total of cases, riik the number of cases with value ct 
for the target variable in the class (leaf) Sk of the partition S, n,k the total 
number of cases in leaf Sk, nt, the total number of cases with value c^. The 
corresponding observed frequencies are denoted respectively by fik-, f.k and /j., 
and /i|fc = nik/n,k stands for the conditional frequency of value Ci in the leaf Sk- 
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Fig. 1. An induced tree 
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Table 1. Contingency table defined by Xj at a node s 
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At any node s of a tree, an attribute Xj defines a partition of the cases 
in s. This partition is described by the columns of the £ x nij contingency table 
(Table 1) that crosses the target variable (rows) with Xj (columns). 

The criteria used to measure the gain of information brought by a split 
defined by Xj are computed from this table. For instance, some methods try to 
maximize the reduction in uncertainty as measured by entropies. In this case, 
the uncertainty after the split is defined as the weighted mean of the uncertainty 
of the columns of the contingency Table 1 

rrij 

k=l ^ 

where h() is, for example, the Shannon entropy, — X)i=i /fife I 0 S 2 /i|fc> or the 
quadratic entropy, also known as the Gini diversity index, ~ /i|fc)- 

Alternatively, some methods like CHAID, optimize the strength or the statis- 
tical significance of the association between the resulting partition (columns of 
Table 1) and the target variable (rows of Table 1). 

3 Optimal Reduction of a Contingency Table 

Let us recall that CHAID tries, at each step, to merge the columns of crosstables 
like Table 1 to find the best grouping of values for each candidate attribute, i.e. 
the grouping that optimizes the criterion. CHAID makes no change, however, 
on the values of the target variable. Arhogodai, like the Twoing approach in 
CART, considers merging both columns and rows. Unlike the Twoing rule that 
looks for the best solution among 2x2 tables only, we seek however the best 
cross-partition without constraints on the number of rows and columns. This 
section discusses this issue. First, we motivate the approach by showing that 
the best n-ary split can sometimes be missed by successive binary splits. Then, 
we examine strategies for determining the best simultaneous partition of the 
rows and the columns. We show that the search of the optimal solution becomes 
rapidly untractable when the number of values of the predictive and/or target 
variable exceeds 5 or 6. This leads us to use a heuristic to get a quasi-optimal 
solution. 
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3.1 Motivation of n-ary Partitions 

Consider the crosstable of Table 2. The best bi-partition of its columns is S'bin = 
{{a,b},{d,e}}, whether we use the Gini, the Twoing, the significance of Pear- 
son’s Chi-Squares or an association measure like the t of Tschuprow. Now, the 
best 3 way partition is S'sway = {{a}j {^7 d}, {e}} with any of the criteria except 
Twoing which is not applicable. Clearly S'sway cannot be obtained by splitting 
the classes of S'bin • This proves that multiple binary partitions are not equivalent 
to n-ary partitions and can sometime miss optimal solutions. 

The merging of response values is different in nature from that of predictive 
attributes. Indeed, the partition of the response values does not translate into a 
split of the node. Considering such mergings in the optimization process merits 
therefore some further justification. This is given by simply extending the argu- 
ment of Breiman et al. ([2], p. 105) who argue that searching for superclasses 
(the groups of the partitions of the response values) provides strategic informa- 
tion on the similarities of responses. When two or more responses, red car and 
blue car for example, are almost equally frequent it may be a better strategy to 
predict that the customer will buy a red or a blue car than explicitly a red one. 
Simultaneously, it may be useful to know that yellow and pink colors are much 
less improbable than all other non red and non blue proposed colors. There is 
thus no reason to limit the argument to two superclasses only. Multi-superclasses 
provide a more refined strategic information. 

3.2 Optimal Reduction 

Let Ts{C,Xj) denote the contingency table obtained by crossing the target 
variable C with the predictive attribute Xj at a given node s of the tree. From 
here on we shall drop the subscripts j and s when there is no ambiguity. Our 
objective is to seek the couple of partitions that produces the table T(C, X) with 
maximal row-column association 6. We write the association criterion as 9(C, X) 
to make it clear that it varies with the partitioning of the values of C and X. 
Let us recall that X stands for the set of distinct values of X in the concerned 
population, and C for the set of distinct values of the target variable C. 

We have to distinguish between ordered and unordered sets X and C. In the 
unordered case, i.e. for nominal attributes, we denote respectively by and Vc 
the sets of possible partitions of X and C. In the ordered case, i.e. for ordinal or 
quantitative attributes, only the merging of adjacent categories is allowed. We 
denote by Ax and Ac the sets of such allowed partitions. 



Table 2. A n-ary solution different from that of successive binary splits 
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For the unordered case, assuming the criterion has to be maximized, we seek 
the best couple of partitions in Pc >^'Px- We have to solve 5 argmax6*(C, X) for 
{C, X) G Pc X Px- Finding the optimal solution requires thus to scan |Pc||’Pa;| 
crosstables. For ordered variables, we replace P with A. 

The number of partitions \P\ of a set of size m is given by Bell’s formula, 

B{m) = "''’ith B{0) = 1. In the ordinal case the number of 

partitions |^| is simply Thus, when both variables are 

nominal with m = £ = 10 categories, we would have to test more than 13 
billions of tables. This is indeed intractable. 

3.3 Reduction Heuristic 

To face the limitations mentioned above, two ways can be considered. The first, is 
to seek the optimal solution among reasonably sized tables only, i.e. by limiting 
partitions to a maximum of say 5 or 6 classes. The second approach consists 
in using a heuristic that would provide a solution close to the true optimum. 

We have considered this last case in our work on the optimal aggregation of 

contingency tables [10] in which we studied an algorithm first introduced in [9]. 
Our experiences with the heuristic led us to two main conclusions. First, the 
heuristic provides solutions that are most of the time very close to the true 
optimum. Secondly, the optimal solution is very unstable. It depends indeed 
strongly on the sample considered. In a learning framework, where the learned 
rules are intended to be applied outside the learning sample, it is then not crucial 
to know the exact learning optimum. A solution close to the optimum is largely 
sufficient. We decided therefore to adopt here the heuristic studied in [10] that 
we recall briefly. 

The heuristic is a simple greedy algorithm. It iteratively merges two rows or 
two columns. At each step, it merges the couple of either rows or columns that 
provides the greatest improvement in the criterion. The algorithm reduces thus 
at each step the table by one row or one column. The process stops when any 
additional merging would deteriorate the criterion. 

Let and be the partitions of the values of C and X after step k. For 
C for example, we denote by P^ the set of partitions that can be obtained in 
the nominal case by grouping two classes of C^. When C* is an ordered set, 
we consider the set of partitions that can be obtained by grouping adjacent 
classes of C^. 

Assuming the criteria 9 has to be maximized, the row-column configuration 
achieved after step k is, for the nominal case, the solution of 

r argmax6»(C'”,A'=) 

< s.t. C'” = and X^ G (2) 

[ or C'” G and X^ = 

For ordinal variables, we replace P^^~^'> by 

The algorithm starts from the finest partitions A° and of the values 
observed at the concerned node s. It seeks iteratively, the tables Tfc(C, A) 
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{k = 1,2,...) corresponding to the double partition solution of problem (2). 
The procedure is repeated as long as we have 9{C^,X^) > 

4 Arbogodai Trees 

We first explain the principle of the Arbogodai algorithm and, then, describe how 
it works on an example. 

4.1 Principle of the Algorithm 

Arbogodai follows the general principle of tree growing presented in Section 2. 
Its specificity is an additional preparatory step before testing the attributes at a 
node. This step consists in optimally reducing the size of the table that crosses 
the target variable with every attribute. The splitting criterion is then computed 
using the found partitions of both the attribute and the target variable values. 
The splitting of the selected node is done according to the found classes of values 
of the selected predictive attribute. 

This additional step plays a role similar to discretization. The merging of 
values can indeed be assimilated to some sort of discretization that works also 
on nominal variables. Remember, however, that the merging is done here simul- 
taneously at each step on the target and the predictive attribute. 

The reduction of the table is that for which the row-column association 9 
is maximized. Indeed we use the heuristic of Section 3.3 and measure the asso- 
ciation 9 with the t of Tschuprow: t = — l)(m— where 

= J2i=i J2T=i {nriik — is the Pearson Chi-Squares statis- 

tic. Unlike some other association measures, the t of Tschuprow may increase 
with the merging of either rows or columns (see [10].) 

The splitting criterion is the reduction in uncertainty (gain in purity) achieved 
with the columns of the reduced table as compared to its margin. The uncer- 
tainty after the split is computed for every Xj by applying formula (1) on the 
optimal reduced table for Xj at the considered node s. Using the * to denote 
quantities derived from the reduced table, the gain in uncertainty reads, with 
the quadratic (Gini) entropy: 

h{c*) - h{c*\x*) = ^ [{Y.f*kf:\i) - f*"] (3) 

i k 

In addition, we use Laplace estimates for the proportions, i.e. the /*’s are com- 
puted by adding a constant A to each cell of the reduced table and of its margins. 
This penalizes the gain of uncertainty obtained at nodes with small counts. With 
very small counts, i.e. when A represents a significant proportion of the count, a 
split may even deteriorate the uncertainty criterion (see [12] p.76.) 

4.2 Example 

We now describe the Arbogodai algorithm through an example. We consider 
the Flags dataset from the UCI repository [1]. The response variable C takes 6 
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Table 3. Step 1 optimal crosstable and Laplace estimates of column distributions 
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C / Ay 
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(cij 

{C 2 , C 4 , C 5 , Ce} 
{C3} 


33 6 0 

2 100 1 

17 9 26 


Total 


52 115 27 



nominal values C = {ci, C2, C3, C3, C4, C5, ce} and there are 29 mixed categorical 
and quantitative predictive attributes Xi, . . . , X 29 - The dataset contains 194 
cases. Figure 2 shows an extract of the two first levels of the Arbogodai tree for 
these data. 

Step 1. At the root of the tree, we have the distribution of all 194 cases among 
the 6 values of the response C. The 29 predictive attributes are successively 
tested. For every attribute, we first determine the optimal reduced crosstable 
with the target variable. We then select the attribute for which the gain in 
uncertainty computed on the reduced table is maximal. The winner is A7, which 
takes 8 values: Ay = {a, 6, c, d, e, /, g, h\. The two simultaneous groupings found 
by the heuristic of Section 3.3 are Af = {{c,d,e,h}]{a,b,g};{f}} and C* = 
{{ci}; {c2, C4) C5) ce}; {cs}}. The corresponding crosstable is shown in Table 3 
together with the table of the derived conditional frequencies The latter 
have been computed by setting A = 1. The marginal uncertainty is h{C*) = 
1 — .20^ — .53^ — .27^ = .61 and the uncertainty after the split, which is the 
weighted average of the uncertainty of each column, is h{C*\X^) = .31. The 
gained information is thus .3. This is the maximal value achievable with any of 
the 29 attributes. 




Fig. 2. Example of an Arbogodai tree 
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Table 4. Crosstable for splitting the two leftmost leaves with A29 





{a,b,d,e} {/} 


{ci, C2, C4} 
C 3 

other 


38 1 

0 3 

0 0 





{a,d,e} {b,f} 


C3 

C 4 

other 


2 2 
8 0 
0 0 



Step 2. The process is repeated on every terminal node of the previously obtained 
tree. Notice that we try to merge the original set of values X and C and not the 
set of previously merged classes. In our example, the next best split occurs at 
the middle node (ATy S {a, b, g}). The attribute selected for splitting this node is 
X^. The 6 values of the target C were merged to form 4 target classes. However, 
no merging of the attribute could improve the association between and the 
target C. The node is therefore split in 4 new classes corresponding to the 4 
values of X 3 . This leads to the tree with 6 leaves shown in Figure 2. 

Following steps. In our example, the tree growing process is stopped after step 2. 
Without explicit stopping rules, the growing continues until the criterion can no 
longer be improved. At step 3, Arbogodai would scan the 6 leaves of the previously 
grown tree. 

Two further remarks should be made: (i) At a same level, nodes that do not 
result from a same parent may have different partitions of the set C of response 
values, (ii) When the same attribute is used as the splitting variable at more than 
one node, its values are not necessarily partitioned the same way for each split. 
For example, growing the tree of Figure 2 one level further leads to split each of 
the two left most leaves of level 2 with the same attribute A29. The corresponding 
crosstables are given in Table 4. It can be seen that the values of C are once par- 
titioned as {{C1,C2,C4},{C3},{C5,C6}} and once as {{03}, {04}, {ci, C2, C5, ce}}. 
Likewise, attribute A29 is used once with the partition {{a, &, d, e}, {/}} and 
once with {{a, e, d}, {&, /}}. 

5 Induces Rules and Their Accuracy 

Arbogodai can generate two types of classification rules: (i) Classical rules by 
disregarding the merged classes of response values in the final leaves, (ii) Multiple 
conclusion rules for leaves with merged response values. This Section specifies 
the nature of these rules, defines error rates adapted for them and presents 
experimentation results. 

We give hereafter the multiple conclusion rules generated by the tree of Fig- 
ure 2. Each path joining the root to a leaf defines the premise of a rule. The 
conclusion is drawn from the distribution in the leaf, i.e. the rule predicts for 
cases falling in the leaf the modal value in the leaf, or modal class of values when 
some are merged. The tree has 6 leaves giving rise to the 6 following rules (the 
value between parentheses is the confidence of the rule for the training data). 
Clearly, when the majority class contains only one value we get classical rules. 
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Here, only R3 and i?4 provide multiple conclusions in the form of “either ci 
or C2.” 



Ri : If X7 G {c, h, e, d} then C = ci (33/52) 

i?2 : If X7 = / then C = C3 (26/27) 

i?3 : If X 7 G {a, b,g} and X 3 = a then C G {ci, C 2 } (34/42) 

i ?4 : If X 7 G {a,b,g} and X^ = b then C G { 03 , 04 } (12/12) 

i ?5 : If X 7 G {a, 6, gj and X 3 = c then C = cq (10/16) 

Re : If X 7 G {a,b,g} and X 3 = d then C = 05 (31/45) 



5.1 Error Rates 

The accuracy of the learned rules is usually assessed with the misclassification 
error rate or equivalently with the classification success rate. For classical rules, 
the misclassification rate reads err = 1 — |s where fs is the pro- 

portion of cases in leaf s and /max|s = max^ fi\s is the frequency of the modal 
response in leaf s. 

For multiple conclusion rules, two kinds of error rates can be defined: 



superclass error 


serr = 1 - /s/max|^ 






sGS 




weighted superclass error 


werr= 1 - /*/max| J 


^ ^ ^ J^i|max,s fi\m.ax.,s') 




sGS 


^^Cniax,s 



where Cmax.s is the set of response values in the modal superclass at leaf s, /i|max,s 
the frequency of response Ci in that superclass and Pi\max,s an estimation of the 
probability of Ci in the superclass. We get resubstitution error rates when the 
frequencies are those of the learning sample and generalization error rates when 
the frequencies are obtained from validation data. The estimations Pi|max,s’s 
are in any case computed on the training data. To get more reliable estimates, 
we use the marginal distribution at the parent node. This can be justified as 
follows. Values are merged when their distributions among the values of the 
split attribute are similar. Hence, their distributions inside the superclass are 
similar too and therefore similar to the marginal distribution. 

The superclass error, serr, is computed as for classical rules but with the 
superclass frequencies instead of the single response frequencies fi\g- Doing 
so, we do not care indeed of classification error inside the modal superclasses. 
This may have sense independently for each rule. We cannot compare, however, 
the error rate of a rule that predicts for instance Ci with that of a rule that 
predicts ci or C2. Hence, the global superclass error does not make much sense. 

The weighted superclass error, werr, takes the uncertainty inside the major- 
ity class into account. It assumes that each case falling in a leaf is randomly 
assigned to a value in the modal superclass. The supposed random assignment 
is done according to the learned distribution inside Cmax,/- For instance for our 
example tree, a case (V7 = a, X3 = a,C = C2) is correctly classified in the modal 
superclass of leaf 3. In that leaf, the estimated proportion of cases taking C = ci 
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in the superclass is 85%. Thus, we weight this correct classification and count it 
as a .85 correct classification. In resubstitution, if we use Pi|max,s = /i|max,s) this 
is equivalent to weighting down the success rates with the Gini uncertainty of 
the distribution inside the superclass: werr = ~ (1 ~ serrs)Gini(C'max,s)]) 

where serrs is the superclass error for rule s. 

It is well known that the learning error rate suffers from an optimistic bias. It 
underestimates the generalization error rate. For validation, it is then common 
to compute the classification error rate on a separate dataset not used for learn- 
ing. Alternatively, and perhaps more frequently, a cross-validation error rate is 
computed. A 10 folds cross-validation (lOGV), for instance, consists in splitting 
the learning sample into 10 approximately equally sized parts. Dropping each 
time a different part we get 10 learning datasets from which 10 trees are in- 
duced. For each of them we compute the error rate on the dropped out data. 
The cross-validation error rate is the mean values of the 10 resulting error rates. 

5.2 Experimentation 

We have experimented our approach on 8 benchmark datasets. Table 5 gives 
the cross-validation success rates obtained for each dataset with Arbogodai and, 
for the sake of comparison, with GART and GHAID. For Arbogodai, we give the 
rate derived from both the classical and the weighted superclass error. Arbogodai 
ranks first for 5 of the 8 datasets whatever error is considered. Unsurprisingly, its 
superiority is mostly significant when the number of values of the target variable 
is large. 



Table 5. Cross-Validation classification success rates (in percents) 



Dataset 


CART 

1 — err stdev 


ChAID 
1 — err stdev 


1 — err 


Arb( 

stdev 


3goda'i 
1 — werr 


stdev 


Iris (3 cl.) 


95.11 


0.08 


94.81 


0.08 


98.35 


0.11 


95.50 


0.08 


Flags (6 cl.) 


75.14 


0.40 


75.21 


0.40 


78.83 


0.41 


83.37 


0.34 


Breast (2 cl.) 


97.54 


0.17 


97.19 


0.15 


98.17 


0.13 


98.08 


0.17 


Car (4 cl.) 


83.47 


0.32 


93.62 


0.23 


86.75 


0.32 


87.81 


0.31 


Ionosphere (2 cl.) 


92.10 


0.19 


89.68 


0.20 


89.34 


0.3 


93.36 


0.25 


Pima (2 cl.) 


84.44 


0.38 


83.55 


0.38 


81.39 


0.38 


81.20 


0.40 


Wine (3 cl.) 


97.71 


0.19 


97.99 


0.19 


98.09 


0.07 


95.21 


0.20 


Zoo (7 cl.) 


87.57 


0.22 


85.99 


0.26 


88.61 


0.12 


94.04 


0.16 



6 Conclusion 

To conclude, we would like to point out that the Arbogodai method is well suited 
for mixed nominal and ordinal multi-valued attributes since the merging of any 
or only adjacent values can be set on the fly. It is also able to handle similarly 
nominal and ordinal, hence quantitative, target variables. Thus, Arbogodai could 
be seen as some sort of regression tree. The originality is that, unlike for instance 
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CART that generates point predictions for each leaf, Arbogodm would provide 
interval predictions. The multi-conclusion of an Arbogodm rule can hence be seen 
as a generalized interval for qualitative responses. Finally, let us mention that we 
are presently designing further experiments for comparing Arbogodm with other 
tree methods and especially CHAID and CART. This aspect requires a careful 
investigation. Indeed, the parameterization of the trees (depth, pruning, stoping 
rules,...) plays a crucial role on the classification performance. We are trying, 
therefore, to set up rigorous conditions that would ensure more fair, hence more 
useful, comparison results. We also plan to investigate the relationship to the 
minimal description length (MDL) principle [8], as the optimally reduced tables 
can be seen as theories that best describe, locally at each node, the relevant 
knowledge about the relation between attributes and the target variable. 
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